2025-05-07T20:22:34.9159694Z Current runner version: '2.323.0'
2025-05-07T20:22:34.9165976Z Runner name: 'i-0a11e2b4e0c9387f6'
2025-05-07T20:22:34.9166904Z Machine name: 'ip-10-0-64-8'
2025-05-07T20:22:34.9169577Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:34.9171841Z Contents: read
2025-05-07T20:22:34.9172433Z Metadata: read
2025-05-07T20:22:34.9172917Z Packages: read
2025-05-07T20:22:34.9173406Z ##[endgroup]
2025-05-07T20:22:34.9175329Z Secret source: None
2025-05-07T20:22:34.9175943Z Prepare workflow directory
2025-05-07T20:22:35.7679180Z Prepare all required actions
2025-05-07T20:22:35.7720774Z Getting action download info
2025-05-07T20:22:35.9704478Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:36.2235513Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:36.5884248Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:38.1730569Z Getting action download info
2025-05-07T20:22:38.2759756Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:38.5065559Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.8.0, 12.6.3, clang)
2025-05-07T20:22:38.5567912Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:38.5673714Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:38.5685181Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:38.5685834Z ##[endgroup]
2025-05-07T20:22:39.5929575Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.5929976Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.5930274Z AMI Name: unknown
2025-05-07T20:22:39.5964854Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.9880249Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.9880541Z with:
2025-05-07T20:22:44.9880753Z   submodules: true
2025-05-07T20:22:44.9880992Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.9881371Z   token: ***
2025-05-07T20:22:44.9881575Z   ssh-strict: true
2025-05-07T20:22:44.9881790Z   ssh-user: git
2025-05-07T20:22:44.9882021Z   persist-credentials: true
2025-05-07T20:22:44.9882280Z   clean: true
2025-05-07T20:22:44.9882513Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.9882785Z   fetch-depth: 1
2025-05-07T20:22:44.9883006Z   fetch-tags: false
2025-05-07T20:22:44.9883230Z   show-progress: true
2025-05-07T20:22:44.9883468Z   lfs: false
2025-05-07T20:22:44.9883708Z   set-safe-directory: true
2025-05-07T20:22:44.9883988Z env:
2025-05-07T20:22:44.9884198Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.9884497Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.9884740Z   BUILD_TARGET: genai
2025-05-07T20:22:44.9884964Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.9885229Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:44.9885483Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.9885725Z ##[endgroup]
2025-05-07T20:22:45.1024600Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:45.1025839Z ##[group]Getting Git version info
2025-05-07T20:22:45.1026297Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:45.1026910Z [command]/usr/bin/git version
2025-05-07T20:22:45.1027683Z git version 2.47.1
2025-05-07T20:22:45.1053413Z ##[endgroup]
2025-05-07T20:22:45.1075322Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0be5a3c3-c8ad-4d74-96f6-b84f970e55ff' before making global git config changes
2025-05-07T20:22:45.1076244Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:45.1080534Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:45.1116522Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:45.6219036Z ##[group]Initializing the repository
2025-05-07T20:22:45.6224664Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:45.6277425Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:45.6278109Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:45.6278660Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:45.6279044Z hint:
2025-05-07T20:22:45.6279320Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:45.6279656Z hint:
2025-05-07T20:22:45.6279960Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:45.6280490Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:45.6280923Z hint:
2025-05-07T20:22:45.6281127Z hint: 	git branch -m <name>
2025-05-07T20:22:45.6281605Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:45.6290071Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.6324395Z ##[endgroup]
2025-05-07T20:22:45.6324825Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:45.6328444Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:45.6360455Z ##[endgroup]
2025-05-07T20:22:45.6360823Z ##[group]Setting up auth
2025-05-07T20:22:45.6366775Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:45.6398033Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:45.6747673Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:45.6780878Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:45.7128653Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.7185243Z ##[endgroup]
2025-05-07T20:22:45.7185647Z ##[group]Fetching the repository
2025-05-07T20:22:45.7192577Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:46.1357325Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:46.1357915Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:46.1384636Z ##[endgroup]
2025-05-07T20:22:46.1385073Z ##[group]Determining the checkout info
2025-05-07T20:22:46.1387670Z ##[endgroup]
2025-05-07T20:22:46.1393716Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:46.1441699Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:46.1480146Z ##[group]Checking out the ref
2025-05-07T20:22:46.1484624Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:46.2581367Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:46.2581755Z 
2025-05-07T20:22:46.2582045Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:46.2582717Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:46.2583590Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:46.2583995Z 
2025-05-07T20:22:46.2584263Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:46.2584885Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:46.2585250Z 
2025-05-07T20:22:46.2585402Z   git switch -c <new-branch-name>
2025-05-07T20:22:46.2585656Z 
2025-05-07T20:22:46.2585847Z Or undo this operation with:
2025-05-07T20:22:46.2586085Z 
2025-05-07T20:22:46.2586200Z   git switch -
2025-05-07T20:22:46.2586687Z 
2025-05-07T20:22:46.2587002Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:46.2587446Z 
2025-05-07T20:22:46.2587975Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:46.2595521Z ##[endgroup]
2025-05-07T20:22:46.2595987Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:46.2600735Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:46.2650601Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:46.2686895Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:46.2722858Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:46.2754373Z ##[endgroup]
2025-05-07T20:22:46.2754754Z ##[group]Fetching submodules
2025-05-07T20:22:46.2757729Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:46.3108148Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:46.3443831Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:46.3445925Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:46.3449011Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:46.3452479Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:46.3456130Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:46.3459783Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:46.3463037Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:46.3493328Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:46.6930230Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:47.1863441Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:47.5917585Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:48.6604997Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.9482287Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:49.1911953Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:50.4146154Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:50.4146641Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:50.4627975Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:51.1458541Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:51.1459024Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:51.4210367Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:52.1150723Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:52.1151164Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:52.2143530Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:53.3465004Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:53.3465465Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:54.0497265Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.8381038Z From https://github.com/google/googletest
2025-05-07T20:22:54.8381497Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.8790984Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:55.4309527Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:55.4310032Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:55.4396031Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:56.1735801Z From https://github.com/nlohmann/json
2025-05-07T20:22:56.1736252Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:56.2872518Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:56.2891614Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:56.3232499Z Entering 'external/asmjit'
2025-05-07T20:22:56.3265387Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.3297224Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.3330127Z Entering 'external/cutlass'
2025-05-07T20:22:56.3361893Z Entering 'external/googletest'
2025-05-07T20:22:56.3393603Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.3426021Z Entering 'external/json'
2025-05-07T20:22:56.3470622Z ##[endgroup]
2025-05-07T20:22:56.3471182Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:56.3478034Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:56.3813870Z Entering 'external/asmjit'
2025-05-07T20:22:56.3878872Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.3951865Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.4019657Z Entering 'external/cutlass'
2025-05-07T20:22:56.4093413Z Entering 'external/googletest'
2025-05-07T20:22:56.4160277Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.4227072Z Entering 'external/json'
2025-05-07T20:22:56.4310654Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:56.4643585Z Entering 'external/asmjit'
2025-05-07T20:22:56.4705643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:56.4708318Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.4771548Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:56.4774558Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.4838677Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:56.4841247Z Entering 'external/cutlass'
2025-05-07T20:22:56.4907597Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:56.4910373Z Entering 'external/googletest'
2025-05-07T20:22:56.4970833Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:56.4973615Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.5034398Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:56.5037028Z Entering 'external/json'
2025-05-07T20:22:56.5096641Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:56.5189397Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.5519554Z Entering 'external/asmjit'
2025-05-07T20:22:56.5552512Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.5585299Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.5617354Z Entering 'external/cutlass'
2025-05-07T20:22:56.5649067Z Entering 'external/googletest'
2025-05-07T20:22:56.5681140Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.5715330Z Entering 'external/json'
2025-05-07T20:22:56.5764469Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.6098070Z Entering 'external/asmjit'
2025-05-07T20:22:56.6130425Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.6163267Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.6195602Z Entering 'external/cutlass'
2025-05-07T20:22:56.6228121Z Entering 'external/googletest'
2025-05-07T20:22:56.6260245Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.6292374Z Entering 'external/json'
2025-05-07T20:22:56.6357069Z ##[endgroup]
2025-05-07T20:22:56.6385810Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.6417123Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.6599965Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.6600269Z with:
2025-05-07T20:22:56.6600503Z   name: fbgemm_genai_x86_clang_py3.12_cu12.8.0.whl
2025-05-07T20:22:56.6600826Z   merge-multiple: false
2025-05-07T20:22:56.6601072Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.6601324Z   run-id: 14891846252
2025-05-07T20:22:56.6601527Z env:
2025-05-07T20:22:56.6601739Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.6602032Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.6602282Z   BUILD_TARGET: genai
2025-05-07T20:22:56.6602499Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.6602732Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:56.6602982Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.6603220Z ##[endgroup]
2025-05-07T20:22:56.8889701Z Downloading single artifact
2025-05-07T20:22:56.9817692Z Preparing to download the following artifacts:
2025-05-07T20:22:56.9818552Z - fbgemm_genai_x86_clang_py3.12_cu12.8.0.whl (ID: 3081397670, Size: 18492313, Expected Digest: sha256:4144078f606f5674fd0d0827aa1139350c9dac781397a58fc2ce2aeb29225152)
2025-05-07T20:22:57.0508653Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0953d042-3ee9-5e70-b1c3-e8d5865d7dd7/artifacts/59a2b3a78d1811f7aac902bf66636fb399e87795f56f74815a857bec15c93d16.zip
2025-05-07T20:22:57.0510118Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:57.1718207Z (node:66942) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:57.1719191Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:57.4545631Z SHA256 digest of downloaded artifact is 4144078f606f5674fd0d0827aa1139350c9dac781397a58fc2ce2aeb29225152
2025-05-07T20:22:57.4546262Z Artifact download completed successfully.
2025-05-07T20:22:57.4546592Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:57.4551701Z Download artifact has finished successfully
2025-05-07T20:22:57.4797201Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:57.4797600Z with:
2025-05-07T20:22:57.4797815Z   driver-version: 570.133.07
2025-05-07T20:22:57.4798055Z env:
2025-05-07T20:22:57.4798273Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.4798575Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.4798810Z   BUILD_TARGET: genai
2025-05-07T20:22:57.4799039Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.4799273Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:57.4799519Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.4799759Z ##[endgroup]
2025-05-07T20:22:57.4889688Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:57.4890096Z with:
2025-05-07T20:22:57.4890553Z   timeout_minutes: 10
2025-05-07T20:22:57.4890790Z   max_attempts: 3
2025-05-07T20:22:57.4915356Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:57.4939716Z   retry_wait_seconds: 10
2025-05-07T20:22:57.4939973Z   polling_interval_seconds: 1
2025-05-07T20:22:57.4940237Z   warning_on_retry: true
2025-05-07T20:22:57.4940487Z   continue_on_error: false
2025-05-07T20:22:57.4940729Z env:
2025-05-07T20:22:57.4940949Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.4941254Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.4941498Z   BUILD_TARGET: genai
2025-05-07T20:22:57.4941723Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.4941964Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:57.4942224Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.4942466Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:57.4942732Z ##[endgroup]
2025-05-07T20:22:57.5637153Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:57.5638130Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:57.5640536Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.9061169Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.9062236Z No packages marked for removal.
2025-05-07T20:22:57.9125665Z Dependencies resolved.
2025-05-07T20:22:57.9135185Z Nothing to do.
2025-05-07T20:22:57.9135599Z Complete!
2025-05-07T20:22:57.9474108Z + install_nvidia_driver_common
2025-05-07T20:22:57.9478266Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.9492362Z + lspci
2025-05-07T20:22:57.9492690Z Before installing NVIDIA driver
2025-05-07T20:22:57.9600345Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.9601079Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.9601649Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.9602173Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.9602666Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.9603200Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.9603753Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.9604322Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.9604732Z + lsmod
2025-05-07T20:22:57.9649456Z Module                  Size  Used by
2025-05-07T20:22:57.9649829Z xt_nat                 16384  0
2025-05-07T20:22:57.9650210Z nvidia_modeset       1716224  0
2025-05-07T20:22:57.9650598Z video                  65536  1 nvidia_modeset
2025-05-07T20:22:57.9651050Z wmi                    36864  1 video
2025-05-07T20:22:57.9651352Z nvidia_uvm           1884160  0
2025-05-07T20:22:57.9651655Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:22:57.9652103Z drm                   602112  1 nvidia
2025-05-07T20:22:57.9652540Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:57.9653035Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:22:57.9653466Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:57.9653745Z veth                   36864  0
2025-05-07T20:22:57.9654000Z xt_conntrack           16384  1
2025-05-07T20:22:57.9654257Z nft_chain_nat          16384  3
2025-05-07T20:22:57.9654515Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.9654837Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.9655186Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.9655613Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.9656082Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.9656403Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.9656701Z xfrm_user              57344  1
2025-05-07T20:22:57.9656961Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.9657253Z xt_addrtype            16384  2
2025-05-07T20:22:57.9657525Z nft_compat             20480  4
2025-05-07T20:22:57.9657829Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.9658250Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.9658624Z br_netfilter           36864  0
2025-05-07T20:22:57.9658897Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.9659197Z stp                    16384  1 bridge
2025-05-07T20:22:57.9659492Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.9659779Z overlay               167936  0
2025-05-07T20:22:57.9660037Z tls                   135168  0
2025-05-07T20:22:57.9660294Z nls_ascii              16384  1
2025-05-07T20:22:57.9660552Z nls_cp437              20480  1
2025-05-07T20:22:57.9660798Z vfat                   24576  1
2025-05-07T20:22:57.9661056Z fat                    86016  1 vfat
2025-05-07T20:22:57.9661370Z sunrpc                696320  1
2025-05-07T20:22:57.9661690Z ena                   180224  0
2025-05-07T20:22:57.9661942Z i8042                  45056  0
2025-05-07T20:22:57.9662201Z serio                  28672  3 i8042
2025-05-07T20:22:57.9662478Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.9662748Z button                 24576  0
2025-05-07T20:22:57.9663010Z sch_fq_codel           20480  17
2025-05-07T20:22:57.9663267Z dm_mod                188416  0
2025-05-07T20:22:57.9663520Z fuse                  163840  1
2025-05-07T20:22:57.9663779Z loop                   36864  0
2025-05-07T20:22:57.9664028Z configfs               57344  1
2025-05-07T20:22:57.9664292Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.9664575Z dmi_sysfs              20480  0
2025-05-07T20:22:57.9665222Z crc32_pclmul           16384  0
2025-05-07T20:22:57.9665493Z crc32c_intel           24576  0
2025-05-07T20:22:57.9665755Z efivarfs               24576  1
2025-05-07T20:22:57.9666096Z + modinfo nvidia
2025-05-07T20:22:57.9668305Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.9668951Z import_ns:      DMA_BUF
2025-05-07T20:22:57.9669293Z alias:          char-major-195-*
2025-05-07T20:22:57.9669648Z version:        570.133.07
2025-05-07T20:22:57.9669906Z supported:      external
2025-05-07T20:22:57.9670158Z license:        Dual MIT/GPL
2025-05-07T20:22:57.9670495Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.9670987Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.9671631Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.9671972Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.9672350Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.9672719Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.9673041Z depends:        i2c-core,drm
2025-05-07T20:22:57.9673307Z retpoline:      Y
2025-05-07T20:22:57.9673539Z name:           nvidia
2025-05-07T20:22:57.9673904Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.9674539Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.9675157Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.9675580Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.9675895Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.9676205Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.9676527Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.9676837Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.9677224Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.9677722Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.9678251Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.9678591Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.9678896Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.9679201Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.9679577Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.9679984Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.9680363Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.9680784Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.9681281Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.9681852Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.9682362Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.9682718Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.9683107Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.9683485Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.9683834Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.9684169Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.9684503Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.9684847Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.9685169Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.9685520Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.9685900Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.9686243Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.9686592Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.9686950Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.9687300Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.9687660Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.9689503Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.9689812Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.9690151Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.9690479Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.9690803Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.9691143Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.9691502Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.9691864Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.9692288Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.9692643Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.9692985Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.9693394Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.9693651Z ++ command -v nvidia-smi
2025-05-07T20:22:57.9693921Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.9694187Z + set +e
2025-05-07T20:22:57.9694508Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:57.9908217Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:57.9908633Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:57.9908962Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:57.9909274Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:57.9909634Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:57.9910187Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:57.9910785Z + set -e
2025-05-07T20:22:57.9911054Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:57.9911578Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:57.9912147Z + post_install_nvidia_driver_common
2025-05-07T20:22:57.9915058Z + sudo modprobe nvidia
2025-05-07T20:22:58.1506539Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:58.1506972Z + lspci
2025-05-07T20:22:58.1507267Z After installing NVIDIA driver
2025-05-07T20:22:58.1622708Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:58.1623384Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:58.1624028Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:58.1624552Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:58.1625051Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:58.1625788Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:58.1626445Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:58.1626933Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:58.1627352Z + lsmod
2025-05-07T20:22:58.1655615Z Module                  Size  Used by
2025-05-07T20:22:58.1656058Z xt_nat                 16384  0
2025-05-07T20:22:58.1656450Z nvidia_modeset       1716224  0
2025-05-07T20:22:58.1656842Z video                  65536  1 nvidia_modeset
2025-05-07T20:22:58.1657247Z wmi                    36864  1 video
2025-05-07T20:22:58.1657524Z nvidia_uvm           1884160  0
2025-05-07T20:22:58.1657947Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:22:58.1658392Z drm                   602112  1 nvidia
2025-05-07T20:22:58.1658798Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:58.1659191Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:22:58.1659538Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:58.1659826Z veth                   36864  0
2025-05-07T20:22:58.1660076Z xt_conntrack           16384  1
2025-05-07T20:22:58.1660333Z nft_chain_nat          16384  3
2025-05-07T20:22:58.1660603Z xt_MASQUERADE          20480  1
2025-05-07T20:22:58.1660910Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:58.1661259Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:58.1661915Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:58.1662383Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:58.1662702Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:58.1662999Z xfrm_user              57344  1
2025-05-07T20:22:58.1663269Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:58.1663557Z xt_addrtype            16384  2
2025-05-07T20:22:58.1663816Z nft_compat             20480  4
2025-05-07T20:22:58.1664114Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:58.1664527Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:58.1664910Z br_netfilter           36864  0
2025-05-07T20:22:58.1665189Z bridge                323584  1 br_netfilter
2025-05-07T20:22:58.1665620Z stp                    16384  1 bridge
2025-05-07T20:22:58.1665906Z llc                    16384  2 bridge,stp
2025-05-07T20:22:58.1666191Z overlay               167936  0
2025-05-07T20:22:58.1666441Z tls                   135168  0
2025-05-07T20:22:58.1666692Z nls_ascii              16384  1
2025-05-07T20:22:58.1666946Z nls_cp437              20480  1
2025-05-07T20:22:58.1667186Z vfat                   24576  1
2025-05-07T20:22:58.1667435Z fat                    86016  1 vfat
2025-05-07T20:22:58.1667700Z sunrpc                696320  1
2025-05-07T20:22:58.1667983Z ena                   180224  0
2025-05-07T20:22:58.1668226Z i8042                  45056  0
2025-05-07T20:22:58.1668472Z serio                  28672  3 i8042
2025-05-07T20:22:58.1668748Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:58.1669011Z button                 24576  0
2025-05-07T20:22:58.1669262Z sch_fq_codel           20480  17
2025-05-07T20:22:58.1669519Z dm_mod                188416  0
2025-05-07T20:22:58.1669778Z fuse                  163840  1
2025-05-07T20:22:58.1670017Z loop                   36864  0
2025-05-07T20:22:58.1670267Z configfs               57344  1
2025-05-07T20:22:58.1670522Z dax                    45056  1 dm_mod
2025-05-07T20:22:58.1670799Z dmi_sysfs              20480  0
2025-05-07T20:22:58.1671046Z crc32_pclmul           16384  0
2025-05-07T20:22:58.1671301Z crc32c_intel           24576  0
2025-05-07T20:22:58.1671551Z efivarfs               24576  1
2025-05-07T20:22:58.1671798Z + modinfo nvidia
2025-05-07T20:22:58.1672854Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:58.1673480Z import_ns:      DMA_BUF
2025-05-07T20:22:58.1673810Z alias:          char-major-195-*
2025-05-07T20:22:58.1674196Z version:        570.133.07
2025-05-07T20:22:58.1674477Z supported:      external
2025-05-07T20:22:58.1674729Z license:        Dual MIT/GPL
2025-05-07T20:22:58.1675013Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:58.1675366Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:58.1675687Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:58.1675999Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:58.1676342Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:58.1676682Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:58.1677002Z depends:        i2c-core,drm
2025-05-07T20:22:58.1677253Z retpoline:      Y
2025-05-07T20:22:58.1677473Z name:           nvidia
2025-05-07T20:22:58.1677899Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:58.1678544Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:58.1679148Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:58.1679615Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:58.1679919Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:58.1680221Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:58.1680543Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:58.1680841Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:58.1681150Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:58.1681637Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:58.1682033Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:58.1682361Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:58.1682664Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:58.1682969Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:58.1683326Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:58.1683727Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:58.1684106Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:58.1684515Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.1684927Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:58.1685498Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.1685912Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:58.1686249Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:58.1686618Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:58.1686992Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:58.1687324Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:58.1687648Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:58.1687980Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:58.1688299Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:58.1688610Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:58.1688957Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:58.1689316Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:58.1689641Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:58.1689985Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:58.1690334Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:58.1690665Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:58.1691015Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:58.1691348Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:58.1691631Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:58.1692022Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:58.1692353Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:58.1692664Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:58.1693000Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:58.1693358Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:58.1693710Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:58.1694031Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:58.1694379Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:58.1694726Z parm:           rm_firmware_active:charp
2025-05-07T20:22:58.1695006Z + set +e
2025-05-07T20:22:58.1695200Z + nvidia-smi
2025-05-07T20:22:58.1851428Z Wed May  7 20:22:58 2025       
2025-05-07T20:22:58.1852006Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:22:58.1852711Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:22:58.1853294Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:22:58.1853795Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:22:58.1854346Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:22:58.1854927Z |                                         |                        |               MIG M. |
2025-05-07T20:22:58.1855270Z |=========================================+========================+======================|
2025-05-07T20:22:58.1990022Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:22:58.1990866Z |  0%   28C    P8             10W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:22:58.1991399Z |                                         |                        |                  N/A |
2025-05-07T20:22:58.1991809Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:22:58.1994752Z                                                                                          
2025-05-07T20:22:58.1995331Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:22:58.1995917Z | Processes:                                                                              |
2025-05-07T20:22:58.1996363Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:22:58.1996953Z |        ID   ID                                                               Usage      |
2025-05-07T20:22:58.1997306Z |=========================================================================================|
2025-05-07T20:22:58.1999705Z |  No running processes found                                                             |
2025-05-07T20:22:58.2000366Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:22:58.4643220Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:22:58.4808992Z NVIDIA A10G
2025-05-07T20:22:58.4850053Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:58.4851022Z + '[' 0 -eq 0 ']'
2025-05-07T20:22:58.4851321Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:22:58.4851612Z + set -e
2025-05-07T20:22:58.4851819Z INFO: Ignoring allowed status 0
2025-05-07T20:22:58.4859346Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:22:58.4863197Z + sudo yum install -y yum-utils
2025-05-07T20:22:58.9122622Z Last metadata expiration check: 0:09:03 ago on Wed May  7 20:13:55 2025.
2025-05-07T20:22:58.9366897Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:22:58.9762175Z Dependencies resolved.
2025-05-07T20:22:58.9943895Z Nothing to do.
2025-05-07T20:22:58.9945011Z Complete!
2025-05-07T20:22:59.0333721Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:22:59.0334330Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:22:59.0335204Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:22:59.3495180Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:22:59.4097089Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:22:59.9287126Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:22:59.9532922Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:22:59.9538351Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed.
2025-05-07T20:22:59.9927614Z Dependencies resolved.
2025-05-07T20:23:00.0111358Z Nothing to do.
2025-05-07T20:23:00.0112278Z Complete!
2025-05-07T20:23:00.0499567Z + sudo systemctl restart docker
2025-05-07T20:23:03.5574469Z nvidia-persistenced failed to initialize. Check syslog for more details.
2025-05-07T20:23:03.5774103Z Wed May  7 20:23:03 2025       
2025-05-07T20:23:03.5774796Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:03.5775401Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:03.5775894Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:03.5776390Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:03.5776968Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:03.5777410Z |                                         |                        |               MIG M. |
2025-05-07T20:23:03.5778037Z |=========================================+========================+======================|
2025-05-07T20:23:03.5910620Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:03.5911067Z |  0%   28C    P8             11W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:03.5911460Z |                                         |                        |                  N/A |
2025-05-07T20:23:03.5911865Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:03.5915664Z                                                                                          
2025-05-07T20:23:03.5916084Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:03.5916756Z | Processes:                                                                              |
2025-05-07T20:23:03.5917216Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:03.5917639Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:03.5917989Z |=========================================================================================|
2025-05-07T20:23:03.5920729Z |  No running processes found                                                             |
2025-05-07T20:23:03.5921213Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:04.5447750Z Command completed after 1 attempt(s).
2025-05-07T20:23:04.5550179Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:04.5550659Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:04.5564183Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:04.5564566Z env:
2025-05-07T20:23:04.5564807Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:04.5565103Z   BUILD_ENV: build_binary
2025-05-07T20:23:04.5565346Z   BUILD_TARGET: genai
2025-05-07T20:23:04.5565566Z   BUILD_VARIANT: cuda
2025-05-07T20:23:04.5565792Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:04.5566044Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:04.5566342Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:04.5566668Z ##[endgroup]
2025-05-07T20:23:04.8950170Z ################################################################################
2025-05-07T20:23:04.8950515Z # Print System Info
2025-05-07T20:23:04.8950738Z #
2025-05-07T20:23:04.8967073Z # [2025-05-07T20:23:04.896Z] + print_system_info 
2025-05-07T20:23:04.8967432Z ################################################################################
2025-05-07T20:23:04.8967651Z 
2025-05-07T20:23:04.8967773Z ################################################################################
2025-05-07T20:23:04.8968100Z [INFO] Printing environment variables ...
2025-05-07T20:23:04.8968394Z + printenv
2025-05-07T20:23:04.8968511Z 
2025-05-07T20:23:04.8994483Z SHELL=/bin/bash
2025-05-07T20:23:04.8995448Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:04.8996023Z BUILD_VARIANT=cuda
2025-05-07T20:23:04.8996749Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4719fd55-4f3e-4a7e-8f0c-08077e88b123
2025-05-07T20:23:04.8997364Z GITHUB_ACTION=__run
2025-05-07T20:23:04.8997652Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:04.8997998Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:04.8998260Z RUNNER_NAME=i-0a11e2b4e0c9387f6
2025-05-07T20:23:04.8998567Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:04.8998872Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:04.8999148Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:04.8999526Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:04.8999981Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:04.9000258Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:04.9000553Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:04.9001487Z ***
2025-05-07T20:23:04.9001681Z LOGNAME=ec2-user
2025-05-07T20:23:04.9001937Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:04.9002200Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:04.9002430Z GITHUB_ACTIONS=true
2025-05-07T20:23:04.9002647Z SYSTEMD_EXEC_PID=55524
2025-05-07T20:23:04.9002929Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:04.9003481Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:04.9004002Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:04.9004288Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:04.9004547Z RUNNER_OS=Linux
2025-05-07T20:23:04.9004764Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:04.9005009Z HOME=/home/ec2-user
2025-05-07T20:23:04.9005595Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:04.9005886Z LANG=C.UTF-8
2025-05-07T20:23:04.9006481Z RUNNER_TRACKING_ID=github_3ac5b347-36c5-4052-8a78-c74e0ef3d3fd
2025-05-07T20:23:04.9006968Z RUNNER_ARCH=X64
2025-05-07T20:23:04.9007250Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:04.9007583Z BUILD_TARGET=genai
2025-05-07T20:23:04.9008125Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_4719fd55-4f3e-4a7e-8f0c-08077e88b123
2025-05-07T20:23:04.9009023Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_4719fd55-4f3e-4a7e-8f0c-08077e88b123
2025-05-07T20:23:04.9009778Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:04.9010492Z INVOCATION_ID=4ac64003978b4062acf61afbbb55318a
2025-05-07T20:23:04.9010833Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:04.9011105Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:04.9011694Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_4719fd55-4f3e-4a7e-8f0c-08077e88b123
2025-05-07T20:23:04.9012413Z BUILD_ENV=build_binary
2025-05-07T20:23:04.9012650Z GITHUB_ACTOR=q10
2025-05-07T20:23:04.9012868Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:04.9013099Z KERN_NAME_LC=linux
2025-05-07T20:23:04.9013326Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:04.9013626Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:04.9013975Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:04.9014226Z USER=ec2-user
2025-05-07T20:23:04.9014454Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:04.9014759Z SHLVL=1
2025-05-07T20:23:04.9014980Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:04.9015288Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:04.9015739Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:04.9016108Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:04.9016349Z KERN_NAME=Linux
2025-05-07T20:23:04.9016575Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:04.9016996Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:04.9017440Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:04.9017711Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:04.9018003Z JOURNAL_STREAM=8:90754
2025-05-07T20:23:04.9018322Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:04.9018689Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:04.9019004Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:04.9019341Z GITHUB_BASE_REF=main
2025-05-07T20:23:04.9019559Z CI=true
2025-05-07T20:23:04.9019771Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:04.9020059Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:04.9020340Z GITHUB_ACTION_REF=
2025-05-07T20:23:04.9020595Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:04.9021220Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_4719fd55-4f3e-4a7e-8f0c-08077e88b123
2025-05-07T20:23:04.9021824Z MACHINE_NAME=x86_64
2025-05-07T20:23:04.9022049Z _=/usr/bin/printenv
2025-05-07T20:23:04.9022191Z 
2025-05-07T20:23:04.9022311Z ################################################################################
2025-05-07T20:23:04.9022637Z [INFO] Print ldd version ...
2025-05-07T20:23:04.9022894Z + ldd --version
2025-05-07T20:23:04.9023029Z 
2025-05-07T20:23:04.9023134Z ldd (GNU libc) 2.34
2025-05-07T20:23:04.9023408Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:04.9023872Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:04.9024418Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:04.9024883Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:04.9025108Z 
2025-05-07T20:23:04.9025235Z ################################################################################
2025-05-07T20:23:04.9025549Z [INFO] Print CPU info ...
2025-05-07T20:23:04.9025794Z + nproc
2025-05-07T20:23:04.9026058Z 
2025-05-07T20:23:04.9043860Z 16
2025-05-07T20:23:04.9045796Z 
2025-05-07T20:23:04.9046004Z + lscpu
2025-05-07T20:23:04.9046135Z 
2025-05-07T20:23:04.9161175Z Architecture:                         x86_64
2025-05-07T20:23:04.9161573Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:04.9162100Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9162648Z Byte Order:                           Little Endian
2025-05-07T20:23:04.9163104Z CPU(s):                               16
2025-05-07T20:23:04.9163525Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:04.9163988Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:04.9164460Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:04.9164881Z CPU family:                           23
2025-05-07T20:23:04.9165513Z Model:                                49
2025-05-07T20:23:04.9165947Z Thread(s) per core:                   2
2025-05-07T20:23:04.9166357Z Core(s) per socket:                   8
2025-05-07T20:23:04.9166759Z Socket(s):                            1
2025-05-07T20:23:04.9167064Z Stepping:                             0
2025-05-07T20:23:04.9167367Z BogoMIPS:                             5600.00
2025-05-07T20:23:04.9169545Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9171716Z Hypervisor vendor:                    KVM
2025-05-07T20:23:04.9172111Z Virtualization type:                  full
2025-05-07T20:23:04.9172457Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:04.9172822Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:04.9173180Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:04.9173532Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:04.9173859Z NUMA node(s):                         1
2025-05-07T20:23:04.9174159Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:04.9174548Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:04.9174929Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:04.9175294Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:04.9175650Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:04.9176018Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:04.9176387Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:04.9176758Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:04.9177389Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:04.9178126Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:04.9178686Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:04.9179378Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:04.9180403Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:04.9181416Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:04.9181978Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:04.9182317Z 
2025-05-07T20:23:04.9182448Z + cat /proc/cpuinfo
2025-05-07T20:23:04.9182645Z 
2025-05-07T20:23:04.9182854Z processor	: 0
2025-05-07T20:23:04.9183385Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9183703Z cpu family	: 23
2025-05-07T20:23:04.9183993Z model		: 49
2025-05-07T20:23:04.9184285Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9184617Z stepping	: 0
2025-05-07T20:23:04.9184907Z microcode	: 0x830107f
2025-05-07T20:23:04.9185183Z cpu MHz		: 3295.310
2025-05-07T20:23:04.9185391Z cache size	: 512 KB
2025-05-07T20:23:04.9185608Z physical id	: 0
2025-05-07T20:23:04.9185816Z siblings	: 16
2025-05-07T20:23:04.9186021Z core id		: 0
2025-05-07T20:23:04.9186215Z cpu cores	: 8
2025-05-07T20:23:04.9186415Z apicid		: 0
2025-05-07T20:23:04.9186614Z initial apicid	: 0
2025-05-07T20:23:04.9186819Z fpu		: yes
2025-05-07T20:23:04.9187016Z fpu_exception	: yes
2025-05-07T20:23:04.9187235Z cpuid level	: 13
2025-05-07T20:23:04.9187436Z wp		: yes
2025-05-07T20:23:04.9189617Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9191958Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9192457Z bogomips	: 5600.00
2025-05-07T20:23:04.9192674Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9192914Z clflush size	: 64
2025-05-07T20:23:04.9193132Z cache_alignment	: 64
2025-05-07T20:23:04.9193400Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9193729Z power management:
2025-05-07T20:23:04.9193867Z 
2025-05-07T20:23:04.9193952Z processor	: 1
2025-05-07T20:23:04.9194172Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9194407Z cpu family	: 23
2025-05-07T20:23:04.9194619Z model		: 49
2025-05-07T20:23:04.9194826Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9195066Z stepping	: 0
2025-05-07T20:23:04.9195277Z microcode	: 0x830107f
2025-05-07T20:23:04.9195506Z cpu MHz		: 3298.873
2025-05-07T20:23:04.9195721Z cache size	: 512 KB
2025-05-07T20:23:04.9195932Z physical id	: 0
2025-05-07T20:23:04.9196139Z siblings	: 16
2025-05-07T20:23:04.9196339Z core id		: 1
2025-05-07T20:23:04.9196538Z cpu cores	: 8
2025-05-07T20:23:04.9196734Z apicid		: 2
2025-05-07T20:23:04.9196929Z initial apicid	: 2
2025-05-07T20:23:04.9197140Z fpu		: yes
2025-05-07T20:23:04.9197345Z fpu_exception	: yes
2025-05-07T20:23:04.9197557Z cpuid level	: 13
2025-05-07T20:23:04.9197765Z wp		: yes
2025-05-07T20:23:04.9199793Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9202112Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9202609Z bogomips	: 5600.00
2025-05-07T20:23:04.9202821Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9203055Z clflush size	: 64
2025-05-07T20:23:04.9203271Z cache_alignment	: 64
2025-05-07T20:23:04.9203537Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9203854Z power management:
2025-05-07T20:23:04.9203992Z 
2025-05-07T20:23:04.9204086Z processor	: 2
2025-05-07T20:23:04.9204296Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9204535Z cpu family	: 23
2025-05-07T20:23:04.9204745Z model		: 49
2025-05-07T20:23:04.9205032Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9205272Z stepping	: 0
2025-05-07T20:23:04.9205479Z microcode	: 0x830107f
2025-05-07T20:23:04.9205696Z cpu MHz		: 3306.374
2025-05-07T20:23:04.9205909Z cache size	: 512 KB
2025-05-07T20:23:04.9206123Z physical id	: 0
2025-05-07T20:23:04.9206596Z siblings	: 16
2025-05-07T20:23:04.9206802Z core id		: 2
2025-05-07T20:23:04.9207002Z cpu cores	: 8
2025-05-07T20:23:04.9207195Z apicid		: 4
2025-05-07T20:23:04.9207393Z initial apicid	: 4
2025-05-07T20:23:04.9207607Z fpu		: yes
2025-05-07T20:23:04.9207800Z fpu_exception	: yes
2025-05-07T20:23:04.9208018Z cpuid level	: 13
2025-05-07T20:23:04.9208234Z wp		: yes
2025-05-07T20:23:04.9210400Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9212808Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9213315Z bogomips	: 5600.00
2025-05-07T20:23:04.9213539Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9213782Z clflush size	: 64
2025-05-07T20:23:04.9213998Z cache_alignment	: 64
2025-05-07T20:23:04.9214278Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9214601Z power management:
2025-05-07T20:23:04.9214758Z 
2025-05-07T20:23:04.9214849Z processor	: 3
2025-05-07T20:23:04.9215083Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9215330Z cpu family	: 23
2025-05-07T20:23:04.9215534Z model		: 49
2025-05-07T20:23:04.9215740Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9215981Z stepping	: 0
2025-05-07T20:23:04.9216188Z microcode	: 0x830107f
2025-05-07T20:23:04.9216414Z cpu MHz		: 3299.309
2025-05-07T20:23:04.9216625Z cache size	: 512 KB
2025-05-07T20:23:04.9216832Z physical id	: 0
2025-05-07T20:23:04.9217043Z siblings	: 16
2025-05-07T20:23:04.9217246Z core id		: 3
2025-05-07T20:23:04.9217443Z cpu cores	: 8
2025-05-07T20:23:04.9217634Z apicid		: 6
2025-05-07T20:23:04.9217833Z initial apicid	: 6
2025-05-07T20:23:04.9218045Z fpu		: yes
2025-05-07T20:23:04.9218238Z fpu_exception	: yes
2025-05-07T20:23:04.9218453Z cpuid level	: 13
2025-05-07T20:23:04.9218658Z wp		: yes
2025-05-07T20:23:04.9220668Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9222976Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9223479Z bogomips	: 5600.00
2025-05-07T20:23:04.9223700Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9223930Z clflush size	: 64
2025-05-07T20:23:04.9224144Z cache_alignment	: 64
2025-05-07T20:23:04.9224417Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9224726Z power management:
2025-05-07T20:23:04.9224866Z 
2025-05-07T20:23:04.9224948Z processor	: 4
2025-05-07T20:23:04.9225162Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9225399Z cpu family	: 23
2025-05-07T20:23:04.9225599Z model		: 49
2025-05-07T20:23:04.9225808Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9226051Z stepping	: 0
2025-05-07T20:23:04.9226252Z microcode	: 0x830107f
2025-05-07T20:23:04.9226603Z cpu MHz		: 3294.123
2025-05-07T20:23:04.9226817Z cache size	: 512 KB
2025-05-07T20:23:04.9227027Z physical id	: 0
2025-05-07T20:23:04.9227239Z siblings	: 16
2025-05-07T20:23:04.9227444Z core id		: 4
2025-05-07T20:23:04.9227638Z cpu cores	: 8
2025-05-07T20:23:04.9227842Z apicid		: 8
2025-05-07T20:23:04.9228040Z initial apicid	: 8
2025-05-07T20:23:04.9228247Z fpu		: yes
2025-05-07T20:23:04.9228495Z fpu_exception	: yes
2025-05-07T20:23:04.9228715Z cpuid level	: 13
2025-05-07T20:23:04.9228915Z wp		: yes
2025-05-07T20:23:04.9231021Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9233488Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9233991Z bogomips	: 5600.00
2025-05-07T20:23:04.9234208Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9234437Z clflush size	: 64
2025-05-07T20:23:04.9234651Z cache_alignment	: 64
2025-05-07T20:23:04.9234921Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9235235Z power management:
2025-05-07T20:23:04.9235370Z 
2025-05-07T20:23:04.9235453Z processor	: 5
2025-05-07T20:23:04.9235664Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9235893Z cpu family	: 23
2025-05-07T20:23:04.9236098Z model		: 49
2025-05-07T20:23:04.9236305Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9236546Z stepping	: 0
2025-05-07T20:23:04.9236750Z microcode	: 0x830107f
2025-05-07T20:23:04.9236972Z cpu MHz		: 3300.744
2025-05-07T20:23:04.9237182Z cache size	: 512 KB
2025-05-07T20:23:04.9237403Z physical id	: 0
2025-05-07T20:23:04.9237612Z siblings	: 16
2025-05-07T20:23:04.9237809Z core id		: 5
2025-05-07T20:23:04.9238006Z cpu cores	: 8
2025-05-07T20:23:04.9238211Z apicid		: 10
2025-05-07T20:23:04.9238406Z initial apicid	: 10
2025-05-07T20:23:04.9238616Z fpu		: yes
2025-05-07T20:23:04.9238812Z fpu_exception	: yes
2025-05-07T20:23:04.9239022Z cpuid level	: 13
2025-05-07T20:23:04.9239227Z wp		: yes
2025-05-07T20:23:04.9241257Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9243570Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9244068Z bogomips	: 5600.00
2025-05-07T20:23:04.9244281Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9244512Z clflush size	: 64
2025-05-07T20:23:04.9244736Z cache_alignment	: 64
2025-05-07T20:23:04.9267500Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9267883Z power management:
2025-05-07T20:23:04.9268056Z 
2025-05-07T20:23:04.9268143Z processor	: 6
2025-05-07T20:23:04.9268367Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9268616Z cpu family	: 23
2025-05-07T20:23:04.9268820Z model		: 49
2025-05-07T20:23:04.9269030Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9269276Z stepping	: 0
2025-05-07T20:23:04.9269480Z microcode	: 0x830107f
2025-05-07T20:23:04.9269720Z cpu MHz		: 3299.915
2025-05-07T20:23:04.9269939Z cache size	: 512 KB
2025-05-07T20:23:04.9270151Z physical id	: 0
2025-05-07T20:23:04.9270365Z siblings	: 16
2025-05-07T20:23:04.9270691Z core id		: 6
2025-05-07T20:23:04.9270885Z cpu cores	: 8
2025-05-07T20:23:04.9271090Z apicid		: 12
2025-05-07T20:23:04.9271298Z initial apicid	: 12
2025-05-07T20:23:04.9271505Z fpu		: yes
2025-05-07T20:23:04.9271712Z fpu_exception	: yes
2025-05-07T20:23:04.9271931Z cpuid level	: 13
2025-05-07T20:23:04.9272134Z wp		: yes
2025-05-07T20:23:04.9274302Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9276625Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9277135Z bogomips	: 5600.00
2025-05-07T20:23:04.9277363Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9277599Z clflush size	: 64
2025-05-07T20:23:04.9277820Z cache_alignment	: 64
2025-05-07T20:23:04.9278094Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9278411Z power management:
2025-05-07T20:23:04.9278552Z 
2025-05-07T20:23:04.9278634Z processor	: 7
2025-05-07T20:23:04.9278852Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9279085Z cpu family	: 23
2025-05-07T20:23:04.9279289Z model		: 49
2025-05-07T20:23:04.9279495Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9279732Z stepping	: 0
2025-05-07T20:23:04.9279939Z microcode	: 0x830107f
2025-05-07T20:23:04.9280171Z cpu MHz		: 3305.439
2025-05-07T20:23:04.9280392Z cache size	: 512 KB
2025-05-07T20:23:04.9280607Z physical id	: 0
2025-05-07T20:23:04.9280818Z siblings	: 16
2025-05-07T20:23:04.9281015Z core id		: 7
2025-05-07T20:23:04.9281217Z cpu cores	: 8
2025-05-07T20:23:04.9281423Z apicid		: 14
2025-05-07T20:23:04.9281620Z initial apicid	: 14
2025-05-07T20:23:04.9281837Z fpu		: yes
2025-05-07T20:23:04.9282039Z fpu_exception	: yes
2025-05-07T20:23:04.9282262Z cpuid level	: 13
2025-05-07T20:23:04.9282463Z wp		: yes
2025-05-07T20:23:04.9284494Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9286808Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9287314Z bogomips	: 5600.00
2025-05-07T20:23:04.9287531Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9287771Z clflush size	: 64
2025-05-07T20:23:04.9287993Z cache_alignment	: 64
2025-05-07T20:23:04.9288262Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9288586Z power management:
2025-05-07T20:23:04.9288719Z 
2025-05-07T20:23:04.9288815Z processor	: 8
2025-05-07T20:23:04.9289027Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9289269Z cpu family	: 23
2025-05-07T20:23:04.9289476Z model		: 49
2025-05-07T20:23:04.9289686Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9289927Z stepping	: 0
2025-05-07T20:23:04.9290138Z microcode	: 0x830107f
2025-05-07T20:23:04.9290364Z cpu MHz		: 3299.774
2025-05-07T20:23:04.9290572Z cache size	: 512 KB
2025-05-07T20:23:04.9290788Z physical id	: 0
2025-05-07T20:23:04.9291002Z siblings	: 16
2025-05-07T20:23:04.9291193Z core id		: 0
2025-05-07T20:23:04.9291387Z cpu cores	: 8
2025-05-07T20:23:04.9291582Z apicid		: 1
2025-05-07T20:23:04.9291865Z initial apicid	: 1
2025-05-07T20:23:04.9292183Z fpu		: yes
2025-05-07T20:23:04.9292376Z fpu_exception	: yes
2025-05-07T20:23:04.9292594Z cpuid level	: 13
2025-05-07T20:23:04.9292807Z wp		: yes
2025-05-07T20:23:04.9294882Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9297284Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9297786Z bogomips	: 5600.00
2025-05-07T20:23:04.9298021Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9298264Z clflush size	: 64
2025-05-07T20:23:04.9298479Z cache_alignment	: 64
2025-05-07T20:23:04.9298747Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9299081Z power management:
2025-05-07T20:23:04.9299215Z 
2025-05-07T20:23:04.9299299Z processor	: 9
2025-05-07T20:23:04.9299526Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9299773Z cpu family	: 23
2025-05-07T20:23:04.9299980Z model		: 49
2025-05-07T20:23:04.9300192Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9300436Z stepping	: 0
2025-05-07T20:23:04.9300653Z microcode	: 0x830107f
2025-05-07T20:23:04.9300877Z cpu MHz		: 3299.361
2025-05-07T20:23:04.9301096Z cache size	: 512 KB
2025-05-07T20:23:04.9301308Z physical id	: 0
2025-05-07T20:23:04.9301514Z siblings	: 16
2025-05-07T20:23:04.9301714Z core id		: 1
2025-05-07T20:23:04.9301917Z cpu cores	: 8
2025-05-07T20:23:04.9302114Z apicid		: 3
2025-05-07T20:23:04.9302310Z initial apicid	: 3
2025-05-07T20:23:04.9302519Z fpu		: yes
2025-05-07T20:23:04.9302715Z fpu_exception	: yes
2025-05-07T20:23:04.9302934Z cpuid level	: 13
2025-05-07T20:23:04.9303137Z wp		: yes
2025-05-07T20:23:04.9305225Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9307986Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9308695Z bogomips	: 5600.00
2025-05-07T20:23:04.9309011Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9309340Z clflush size	: 64
2025-05-07T20:23:04.9309661Z cache_alignment	: 64
2025-05-07T20:23:04.9310052Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9310499Z power management:
2025-05-07T20:23:04.9310681Z 
2025-05-07T20:23:04.9310794Z processor	: 10
2025-05-07T20:23:04.9311089Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9311397Z cpu family	: 23
2025-05-07T20:23:04.9311655Z model		: 49
2025-05-07T20:23:04.9311924Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9312254Z stepping	: 0
2025-05-07T20:23:04.9312528Z microcode	: 0x830107f
2025-05-07T20:23:04.9312837Z cpu MHz		: 3303.280
2025-05-07T20:23:04.9313139Z cache size	: 512 KB
2025-05-07T20:23:04.9313433Z physical id	: 0
2025-05-07T20:23:04.9313689Z siblings	: 16
2025-05-07T20:23:04.9313893Z core id		: 2
2025-05-07T20:23:04.9314084Z cpu cores	: 8
2025-05-07T20:23:04.9314283Z apicid		: 5
2025-05-07T20:23:04.9314487Z initial apicid	: 5
2025-05-07T20:23:04.9314691Z fpu		: yes
2025-05-07T20:23:04.9314891Z fpu_exception	: yes
2025-05-07T20:23:04.9315112Z cpuid level	: 13
2025-05-07T20:23:04.9315509Z wp		: yes
2025-05-07T20:23:04.9317523Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9319831Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9320329Z bogomips	: 5600.00
2025-05-07T20:23:04.9320676Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9320912Z clflush size	: 64
2025-05-07T20:23:04.9321130Z cache_alignment	: 64
2025-05-07T20:23:04.9321401Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9321720Z power management:
2025-05-07T20:23:04.9321857Z 
2025-05-07T20:23:04.9321941Z processor	: 11
2025-05-07T20:23:04.9322159Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9322390Z cpu family	: 23
2025-05-07T20:23:04.9322595Z model		: 49
2025-05-07T20:23:04.9322797Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9323030Z stepping	: 0
2025-05-07T20:23:04.9323237Z microcode	: 0x830107f
2025-05-07T20:23:04.9323497Z cpu MHz		: 3298.440
2025-05-07T20:23:04.9323781Z cache size	: 512 KB
2025-05-07T20:23:04.9324076Z physical id	: 0
2025-05-07T20:23:04.9324375Z siblings	: 16
2025-05-07T20:23:04.9324643Z core id		: 3
2025-05-07T20:23:04.9324927Z cpu cores	: 8
2025-05-07T20:23:04.9325220Z apicid		: 7
2025-05-07T20:23:04.9325482Z initial apicid	: 7
2025-05-07T20:23:04.9325772Z fpu		: yes
2025-05-07T20:23:04.9326028Z fpu_exception	: yes
2025-05-07T20:23:04.9326303Z cpuid level	: 13
2025-05-07T20:23:04.9326578Z wp		: yes
2025-05-07T20:23:04.9329017Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9331343Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9331838Z bogomips	: 5600.00
2025-05-07T20:23:04.9332132Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9332371Z clflush size	: 64
2025-05-07T20:23:04.9332586Z cache_alignment	: 64
2025-05-07T20:23:04.9332850Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9333170Z power management:
2025-05-07T20:23:04.9333306Z 
2025-05-07T20:23:04.9333395Z processor	: 12
2025-05-07T20:23:04.9333607Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9333846Z cpu family	: 23
2025-05-07T20:23:04.9334052Z model		: 49
2025-05-07T20:23:04.9334260Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9334544Z stepping	: 0
2025-05-07T20:23:04.9334758Z microcode	: 0x830107f
2025-05-07T20:23:04.9334984Z cpu MHz		: 3299.126
2025-05-07T20:23:04.9335193Z cache size	: 512 KB
2025-05-07T20:23:04.9335407Z physical id	: 0
2025-05-07T20:23:04.9335616Z siblings	: 16
2025-05-07T20:23:04.9335809Z core id		: 4
2025-05-07T20:23:04.9336007Z cpu cores	: 8
2025-05-07T20:23:04.9336201Z apicid		: 9
2025-05-07T20:23:04.9336391Z initial apicid	: 9
2025-05-07T20:23:04.9336601Z fpu		: yes
2025-05-07T20:23:04.9336800Z fpu_exception	: yes
2025-05-07T20:23:04.9337014Z cpuid level	: 13
2025-05-07T20:23:04.9337221Z wp		: yes
2025-05-07T20:23:04.9339882Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9343053Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9343587Z bogomips	: 5600.00
2025-05-07T20:23:04.9343809Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9344048Z clflush size	: 64
2025-05-07T20:23:04.9344275Z cache_alignment	: 64
2025-05-07T20:23:04.9344680Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9345023Z power management:
2025-05-07T20:23:04.9345158Z 
2025-05-07T20:23:04.9345250Z processor	: 13
2025-05-07T20:23:04.9345494Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9345741Z cpu family	: 23
2025-05-07T20:23:04.9345957Z model		: 49
2025-05-07T20:23:04.9346162Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9346434Z stepping	: 0
2025-05-07T20:23:04.9346646Z microcode	: 0x830107f
2025-05-07T20:23:04.9346875Z cpu MHz		: 3292.969
2025-05-07T20:23:04.9347080Z cache size	: 512 KB
2025-05-07T20:23:04.9347295Z physical id	: 0
2025-05-07T20:23:04.9347502Z siblings	: 16
2025-05-07T20:23:04.9347695Z core id		: 5
2025-05-07T20:23:04.9347893Z cpu cores	: 8
2025-05-07T20:23:04.9348097Z apicid		: 11
2025-05-07T20:23:04.9348314Z initial apicid	: 11
2025-05-07T20:23:04.9348536Z fpu		: yes
2025-05-07T20:23:04.9348731Z fpu_exception	: yes
2025-05-07T20:23:04.9348940Z cpuid level	: 13
2025-05-07T20:23:04.9349153Z wp		: yes
2025-05-07T20:23:04.9351499Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9354860Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9355579Z bogomips	: 5600.00
2025-05-07T20:23:04.9355883Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9356217Z clflush size	: 64
2025-05-07T20:23:04.9356526Z cache_alignment	: 64
2025-05-07T20:23:04.9356882Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9357232Z power management:
2025-05-07T20:23:04.9357369Z 
2025-05-07T20:23:04.9357462Z processor	: 14
2025-05-07T20:23:04.9357679Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9357929Z cpu family	: 23
2025-05-07T20:23:04.9358140Z model		: 49
2025-05-07T20:23:04.9358344Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9358594Z stepping	: 0
2025-05-07T20:23:04.9358803Z microcode	: 0x830107f
2025-05-07T20:23:04.9359031Z cpu MHz		: 3297.150
2025-05-07T20:23:04.9359250Z cache size	: 512 KB
2025-05-07T20:23:04.9359471Z physical id	: 0
2025-05-07T20:23:04.9359678Z siblings	: 16
2025-05-07T20:23:04.9359884Z core id		: 6
2025-05-07T20:23:04.9360083Z cpu cores	: 8
2025-05-07T20:23:04.9360283Z apicid		: 13
2025-05-07T20:23:04.9360488Z initial apicid	: 13
2025-05-07T20:23:04.9360708Z fpu		: yes
2025-05-07T20:23:04.9360904Z fpu_exception	: yes
2025-05-07T20:23:04.9361127Z cpuid level	: 13
2025-05-07T20:23:04.9361376Z wp		: yes
2025-05-07T20:23:04.9363955Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9366811Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9367554Z bogomips	: 5600.00
2025-05-07T20:23:04.9367875Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9368208Z clflush size	: 64
2025-05-07T20:23:04.9368514Z cache_alignment	: 64
2025-05-07T20:23:04.9368909Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9369386Z power management:
2025-05-07T20:23:04.9369577Z 
2025-05-07T20:23:04.9369830Z processor	: 15
2025-05-07T20:23:04.9370130Z vendor_id	: AuthenticAMD
2025-05-07T20:23:04.9370444Z cpu family	: 23
2025-05-07T20:23:04.9370720Z model		: 49
2025-05-07T20:23:04.9371015Z model name	: AMD EPYC 7R32
2025-05-07T20:23:04.9371360Z stepping	: 0
2025-05-07T20:23:04.9371644Z microcode	: 0x830107f
2025-05-07T20:23:04.9372016Z cpu MHz		: 3313.445
2025-05-07T20:23:04.9372241Z cache size	: 512 KB
2025-05-07T20:23:04.9372461Z physical id	: 0
2025-05-07T20:23:04.9372667Z siblings	: 16
2025-05-07T20:23:04.9372871Z core id		: 7
2025-05-07T20:23:04.9373076Z cpu cores	: 8
2025-05-07T20:23:04.9373276Z apicid		: 15
2025-05-07T20:23:04.9373487Z initial apicid	: 15
2025-05-07T20:23:04.9373711Z fpu		: yes
2025-05-07T20:23:04.9373909Z fpu_exception	: yes
2025-05-07T20:23:04.9374135Z cpuid level	: 13
2025-05-07T20:23:04.9374348Z wp		: yes
2025-05-07T20:23:04.9376743Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:04.9379568Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:04.9380143Z bogomips	: 5600.00
2025-05-07T20:23:04.9380375Z TLB size	: 3072 4K pages
2025-05-07T20:23:04.9380621Z clflush size	: 64
2025-05-07T20:23:04.9380858Z cache_alignment	: 64
2025-05-07T20:23:04.9381153Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:04.9381503Z power management:
2025-05-07T20:23:04.9381648Z 
2025-05-07T20:23:04.9381652Z 
2025-05-07T20:23:04.9381771Z ################################################################################
2025-05-07T20:23:04.9382119Z [INFO] Print PCI info ...
2025-05-07T20:23:04.9382377Z + lspci -v
2025-05-07T20:23:04.9382500Z 
2025-05-07T20:23:04.9382742Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:04.9383164Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:04.9383518Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:04.9383745Z 
2025-05-07T20:23:04.9383962Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:04.9384382Z 	Physical Slot: 1
2025-05-07T20:23:04.9384629Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:04.9384857Z 
2025-05-07T20:23:04.9385133Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:04.9385617Z 	Physical Slot: 1
2025-05-07T20:23:04.9385880Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:04.9386131Z 
2025-05-07T20:23:04.9386426Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:04.9386924Z 	Physical Slot: 3
2025-05-07T20:23:04.9387168Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:04.9387645Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:04.9388036Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:04.9388283Z 
2025-05-07T20:23:04.9388626Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:04.9389230Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:04.9389540Z 	Physical Slot: 4
2025-05-07T20:23:04.9389809Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:04.9390215Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:04.9390595Z 	Capabilities: <access denied>
2025-05-07T20:23:04.9390878Z 	Kernel driver in use: nvme
2025-05-07T20:23:04.9391050Z 
2025-05-07T20:23:04.9391386Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:04.9391912Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:04.9392297Z 	Physical Slot: 5
2025-05-07T20:23:04.9392557Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:04.9392936Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:04.9393357Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:04.9393709Z 	Capabilities: <access denied>
2025-05-07T20:23:04.9393985Z 	Kernel driver in use: ena
2025-05-07T20:23:04.9394238Z 	Kernel modules: ena
2025-05-07T20:23:04.9394383Z 
2025-05-07T20:23:04.9394571Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:04.9395033Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:04.9395344Z 	Physical Slot: 30
2025-05-07T20:23:04.9395619Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:04.9396029Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:04.9396451Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:04.9396857Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:04.9397218Z 	Capabilities: <access denied>
2025-05-07T20:23:04.9397503Z 	Kernel driver in use: nvidia
2025-05-07T20:23:04.9397782Z 	Kernel modules: nvidia
2025-05-07T20:23:04.9397935Z 
2025-05-07T20:23:04.9398287Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:04.9398860Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:04.9399165Z 	Physical Slot: 31
2025-05-07T20:23:04.9399421Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:04.9399812Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:04.9400221Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:04.9400575Z 	Capabilities: <access denied>
2025-05-07T20:23:04.9400861Z 	Kernel driver in use: nvme
2025-05-07T20:23:04.9401031Z 
2025-05-07T20:23:04.9401035Z 
2025-05-07T20:23:04.9401153Z ################################################################################
2025-05-07T20:23:04.9401501Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:04.9401810Z + uname -a
2025-05-07T20:23:04.9401928Z 
2025-05-07T20:23:04.9402381Z Linux ip-10-0-64-8.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:04.9402946Z 
2025-05-07T20:23:04.9403027Z + uname -m
2025-05-07T20:23:04.9403152Z 
2025-05-07T20:23:04.9403226Z x86_64
2025-05-07T20:23:04.9403335Z 
2025-05-07T20:23:04.9403429Z + cat /proc/version
2025-05-07T20:23:04.9403566Z 
2025-05-07T20:23:04.9404182Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:04.9404949Z 
2025-05-07T20:23:04.9405038Z + cat /etc/os-release
2025-05-07T20:23:04.9405196Z 
2025-05-07T20:23:04.9405292Z NAME="Amazon Linux"
2025-05-07T20:23:04.9405513Z VERSION="2023"
2025-05-07T20:23:04.9405717Z ID="amzn"
2025-05-07T20:23:04.9405913Z ID_LIKE="fedora"
2025-05-07T20:23:04.9406450Z VERSION_ID="2023"
2025-05-07T20:23:04.9406688Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:04.9406987Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:04.9407291Z ANSI_COLOR="0;33"
2025-05-07T20:23:04.9407550Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:04.9407971Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:04.9408446Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:04.9408898Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:04.9409378Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:04.9409796Z VENDOR_NAME="AWS"
2025-05-07T20:23:04.9410046Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:04.9410351Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:04.9410518Z 
2025-05-07T20:23:04.9410784Z ################################################################################
2025-05-07T20:23:04.9411119Z # Print EC2 Instance Info
2025-05-07T20:23:04.9411362Z #
2025-05-07T20:23:04.9411585Z # [2025-05-07T20:23:04.938Z] + print_ec2_info 
2025-05-07T20:23:04.9411916Z ################################################################################
2025-05-07T20:23:04.9412211Z 
2025-05-07T20:23:04.9509537Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:04.9627864Z instance-id: i-0a11e2b4e0c9387f6
2025-05-07T20:23:04.9740546Z instance-type: g5.4xlarge
2025-05-07T20:23:04.9786203Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:04.9786573Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:04.9796419Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:04.9796778Z env:
2025-05-07T20:23:04.9797008Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:04.9797312Z   BUILD_ENV: build_binary
2025-05-07T20:23:04.9797563Z   BUILD_TARGET: genai
2025-05-07T20:23:04.9797795Z   BUILD_VARIANT: cuda
2025-05-07T20:23:04.9798027Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:04.9798287Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:04.9798591Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:04.9798936Z ##[endgroup]
2025-05-07T20:23:05.3170053Z ################################################################################
2025-05-07T20:23:05.3170419Z [INFO] Printing general display info ...
2025-05-07T20:23:05.3186919Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:05.4387681Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:05.4398138Z /usr/bin/sudo
2025-05-07T20:23:05.4408942Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:05.4418829Z /usr/bin/yum
2025-05-07T20:23:05.4420530Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:05.4442526Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:05.8658866Z Last metadata expiration check: 0:00:06 ago on Wed May  7 20:22:59 2025.
2025-05-07T20:23:05.9479767Z ================================================================================
2025-05-07T20:23:05.9480443Z WARNING:
2025-05-07T20:23:05.9480953Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:05.9481411Z 
2025-05-07T20:23:05.9481606Z   Available Versions:
2025-05-07T20:23:05.9481900Z 
2025-05-07T20:23:05.9482082Z   Version 2023.7.20250331:
2025-05-07T20:23:05.9482712Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:05.9483226Z 
2025-05-07T20:23:05.9483497Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:05.9483922Z 
2025-05-07T20:23:05.9484110Z     Release notes:
2025-05-07T20:23:05.9484845Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:05.9485281Z 
2025-05-07T20:23:05.9485375Z   Version 2023.7.20250414:
2025-05-07T20:23:05.9485698Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:05.9485957Z 
2025-05-07T20:23:05.9486078Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:05.9486299Z 
2025-05-07T20:23:05.9486389Z     Release notes:
2025-05-07T20:23:05.9487004Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:05.9487380Z 
2025-05-07T20:23:05.9487481Z   Version 2023.7.20250428:
2025-05-07T20:23:05.9487790Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:05.9488054Z 
2025-05-07T20:23:05.9488175Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:05.9488394Z 
2025-05-07T20:23:05.9488492Z     Release notes:
2025-05-07T20:23:05.9488891Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:05.9489272Z 
2025-05-07T20:23:05.9489390Z ================================================================================
2025-05-07T20:23:06.0633004Z Dependencies resolved.
2025-05-07T20:23:06.0918193Z ================================================================================
2025-05-07T20:23:06.0918625Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:06.0919012Z ================================================================================
2025-05-07T20:23:06.0919337Z Upgrading:
2025-05-07T20:23:06.0919707Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:06.0920306Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:06.0920679Z 
2025-05-07T20:23:06.0921000Z Transaction Summary
2025-05-07T20:23:06.0921268Z ================================================================================
2025-05-07T20:23:06.0921594Z Upgrade  2 Packages
2025-05-07T20:23:06.0921732Z 
2025-05-07T20:23:06.0921833Z Total download size: 6.9 M
2025-05-07T20:23:06.0922772Z Downloading Packages:
2025-05-07T20:23:06.1332617Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  31 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:06.1799323Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  66 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:06.1809079Z --------------------------------------------------------------------------------
2025-05-07T20:23:06.1812060Z Total                                            78 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:06.1814464Z Running transaction check
2025-05-07T20:23:06.1909720Z Transaction check succeeded.
2025-05-07T20:23:06.1910358Z Running transaction test
2025-05-07T20:23:06.2204989Z Transaction test succeeded.
2025-05-07T20:23:06.2208252Z Running transaction
2025-05-07T20:23:06.7722200Z   Preparing        :                                                        1/1 
2025-05-07T20:23:06.8778220Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:06.8801631Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.9050753Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.9051573Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:06.9154075Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:06.9180717Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:07.0994786Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:07.0995395Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:07.0995961Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:07.0996508Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:07.2472421Z ================================================================================
2025-05-07T20:23:07.2472797Z WARNING:
2025-05-07T20:23:07.2473047Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:07.2473282Z 
2025-05-07T20:23:07.2473374Z   Available Versions:
2025-05-07T20:23:07.2473529Z 
2025-05-07T20:23:07.2473621Z   Version 2023.7.20250331:
2025-05-07T20:23:07.2473945Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:07.2474495Z 
2025-05-07T20:23:07.2474622Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:07.2474847Z 
2025-05-07T20:23:07.2474938Z     Release notes:
2025-05-07T20:23:07.2475364Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:07.2475750Z 
2025-05-07T20:23:07.2475864Z   Version 2023.7.20250414:
2025-05-07T20:23:07.2476182Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:07.2476462Z 
2025-05-07T20:23:07.2476584Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:07.2476800Z 
2025-05-07T20:23:07.2476896Z     Release notes:
2025-05-07T20:23:07.2477300Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:07.2477681Z 
2025-05-07T20:23:07.2477775Z   Version 2023.7.20250428:
2025-05-07T20:23:07.2478095Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:07.2478358Z 
2025-05-07T20:23:07.2478486Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:07.2478702Z 
2025-05-07T20:23:07.2478794Z     Release notes:
2025-05-07T20:23:07.2479206Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:07.2479584Z 
2025-05-07T20:23:07.2479930Z ================================================================================
2025-05-07T20:23:07.3055857Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:07.3056234Z 
2025-05-07T20:23:07.3056324Z Upgraded:
2025-05-07T20:23:07.3056690Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:07.3057290Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:07.3057644Z 
2025-05-07T20:23:07.3057742Z Complete!
2025-05-07T20:23:07.3491730Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:07.3515802Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:07.7481990Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:22:59 2025.
2025-05-07T20:23:07.7723679Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:07.8122920Z Dependencies resolved.
2025-05-07T20:23:07.8300100Z ================================================================================
2025-05-07T20:23:07.8301041Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:07.8301887Z ================================================================================
2025-05-07T20:23:07.8302477Z Installing:
2025-05-07T20:23:07.8303046Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:07.8303601Z 
2025-05-07T20:23:07.8303777Z Transaction Summary
2025-05-07T20:23:07.8304259Z ================================================================================
2025-05-07T20:23:07.8304847Z Install  1 Package
2025-05-07T20:23:07.8305144Z 
2025-05-07T20:23:07.8305266Z Total download size: 319 k
2025-05-07T20:23:07.8305518Z Installed size: 837 k
2025-05-07T20:23:07.8305756Z Downloading Packages:
2025-05-07T20:23:07.9777192Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        2.6 MB/s | 319 kB     00:00    
2025-05-07T20:23:07.9782969Z --------------------------------------------------------------------------------
2025-05-07T20:23:07.9785681Z Total                                           2.1 MB/s | 319 kB     00:00     
2025-05-07T20:23:07.9942933Z Running transaction check
2025-05-07T20:23:07.9997863Z Transaction check succeeded.
2025-05-07T20:23:07.9998246Z Running transaction test
2025-05-07T20:23:08.0447319Z Transaction test succeeded.
2025-05-07T20:23:08.0451050Z Running transaction
2025-05-07T20:23:08.1463798Z   Preparing        :                                                        1/1 
2025-05-07T20:23:08.1969137Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:08.4019228Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:08.5292955Z ================================================================================
2025-05-07T20:23:08.5293544Z WARNING:
2025-05-07T20:23:08.5293927Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:08.5294309Z 
2025-05-07T20:23:08.5294452Z   Available Versions:
2025-05-07T20:23:08.5294712Z 
2025-05-07T20:23:08.5294849Z   Version 2023.7.20250331:
2025-05-07T20:23:08.5295268Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:08.5295663Z 
2025-05-07T20:23:08.5295829Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:08.5296157Z 
2025-05-07T20:23:08.5296275Z     Release notes:
2025-05-07T20:23:08.5296882Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:08.5297432Z 
2025-05-07T20:23:08.5297557Z   Version 2023.7.20250414:
2025-05-07T20:23:08.5298017Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:08.5298386Z 
2025-05-07T20:23:08.5298563Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:08.5298860Z 
2025-05-07T20:23:08.5298987Z     Release notes:
2025-05-07T20:23:08.5299553Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:08.5300076Z 
2025-05-07T20:23:08.5300526Z   Version 2023.7.20250428:
2025-05-07T20:23:08.5300979Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:08.5301343Z 
2025-05-07T20:23:08.5301504Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:08.5301824Z 
2025-05-07T20:23:08.5301946Z     Release notes:
2025-05-07T20:23:08.5302526Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:08.5303083Z 
2025-05-07T20:23:08.5303249Z ================================================================================
2025-05-07T20:23:08.5643342Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:08.5643873Z 
2025-05-07T20:23:08.5644006Z Installed:
2025-05-07T20:23:08.5644429Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:08.5644861Z 
2025-05-07T20:23:08.5644985Z Complete!
2025-05-07T20:23:08.6101673Z + hostname
2025-05-07T20:23:08.6101824Z 
2025-05-07T20:23:08.6116272Z ip-10-0-64-8.ec2.internal
2025-05-07T20:23:08.6118118Z 
2025-05-07T20:23:08.6118483Z + sudo lshw -C display
2025-05-07T20:23:08.6118653Z 
2025-05-07T20:23:09.2732061Z   *-display:0 UNCLAIMED
2025-05-07T20:23:09.2732399Z        description: VGA compatible controller
2025-05-07T20:23:09.2732729Z        product: Amazon.com, Inc.
2025-05-07T20:23:09.2733015Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:09.2733282Z        physical id: 3
2025-05-07T20:23:09.2733524Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:09.2733781Z        version: 00
2025-05-07T20:23:09.2733998Z        width: 32 bits
2025-05-07T20:23:09.2734225Z        clock: 33MHz
2025-05-07T20:23:09.2734501Z        capabilities: vga_controller bus_master
2025-05-07T20:23:09.2734821Z        configuration: latency=0
2025-05-07T20:23:09.2735154Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:09.2735488Z   *-display:1
2025-05-07T20:23:09.2735715Z        description: 3D controller
2025-05-07T20:23:09.2735994Z        product: GA102GL [A10G]
2025-05-07T20:23:09.2736269Z        vendor: NVIDIA Corporation
2025-05-07T20:23:09.2736540Z        physical id: 1e
2025-05-07T20:23:09.2736781Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:09.2737031Z        version: a1
2025-05-07T20:23:09.2737245Z        width: 64 bits
2025-05-07T20:23:09.2737469Z        clock: 33MHz
2025-05-07T20:23:09.2737756Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:09.2738140Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:09.2738777Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:09.2770555Z 
2025-05-07T20:23:09.2770770Z ################################################################################
2025-05-07T20:23:09.2771095Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:09.2900485Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:09.3084096Z Wed May  7 20:23:09 2025       
2025-05-07T20:23:09.3084495Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:09.3085165Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:09.3085664Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:09.3086209Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:09.3086744Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:09.3087174Z |                                         |                        |               MIG M. |
2025-05-07T20:23:09.3087518Z |=========================================+========================+======================|
2025-05-07T20:23:09.3219168Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:09.3220000Z |  0%   28C    P8             10W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:09.3220397Z |                                         |                        |                  N/A |
2025-05-07T20:23:09.3220802Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:09.3223978Z                                                                                          
2025-05-07T20:23:09.3224564Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:09.3225096Z | Processes:                                                                              |
2025-05-07T20:23:09.3225554Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:09.3225976Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:09.3226334Z |=========================================================================================|
2025-05-07T20:23:09.3229135Z |  No running processes found                                                             |
2025-05-07T20:23:09.3229656Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:09.5902316Z ################################################################################
2025-05-07T20:23:09.5902760Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:09.6044704Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:09.6045792Z [CHECK] rocminfo not found
2025-05-07T20:23:09.6055195Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:09.6056490Z [CHECK] rocm-smi not found
2025-05-07T20:23:09.6089318Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:09.6089752Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:09.6100842Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:09.6101195Z env:
2025-05-07T20:23:09.6101417Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:09.6101728Z   BUILD_ENV: build_binary
2025-05-07T20:23:09.6101977Z   BUILD_TARGET: genai
2025-05-07T20:23:09.6102209Z   BUILD_VARIANT: cuda
2025-05-07T20:23:09.6102451Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:09.6102712Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:09.6103021Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:09.6103352Z ##[endgroup]
2025-05-07T20:23:09.9464096Z ################################################################################
2025-05-07T20:23:09.9464443Z # Setup Miniconda
2025-05-07T20:23:09.9464947Z #
2025-05-07T20:23:09.9479148Z # [2025-05-07T20:23:09.947Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:09.9479553Z ################################################################################
2025-05-07T20:23:09.9479787Z 
2025-05-07T20:23:09.9493937Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:10.0478620Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:10.0479002Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:10.0479203Z 
2025-05-07T20:23:10.0498732Z 
2025-05-07T20:23:10.0499221Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:10.0520774Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:11.4728609Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:11.4729162Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:11.4729480Z 
2025-05-07T20:23:11.4873204Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:11.9324091Z Unpacking payload ...
2025-05-07T20:23:12.4512043Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:13.2518811Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:15.3522105Z 
2025-05-07T20:23:15.3522627Z Installing base environment...
2025-05-07T20:23:15.3522850Z 
2025-05-07T20:23:16.4194472Z Preparing transaction: ...working... done
2025-05-07T20:23:19.2928331Z Executing transaction: ...working... done
2025-05-07T20:23:19.9506922Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:20.0391164Z installation finished.
2025-05-07T20:23:20.0397931Z 
2025-05-07T20:23:20.0398391Z + rm -f miniconda.sh
2025-05-07T20:23:20.0398598Z 
2025-05-07T20:23:20.0714519Z 
2025-05-07T20:23:20.0714888Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:20.0715250Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:20.0715473Z 
2025-05-07T20:23:20.4362575Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:20.4362965Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:20.4363314Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:20.4363671Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:20.4364028Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:20.4364423Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:20.4364851Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:20.4365295Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:20.4365771Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:20.4366537Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:20.4367075Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:20.4367442Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:20.4367635Z 
2025-05-07T20:23:20.4367843Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:20.4368149Z 
2025-05-07T20:23:20.5019947Z 
2025-05-07T20:23:20.5020536Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:20.5020750Z 
2025-05-07T20:23:21.3303233Z 
2025-05-07T20:23:21.3303816Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:21.3327224Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:34.4748981Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:36.0540522Z Solving environment: / - \ | / - \ | / - \ | done
2025-05-07T20:23:36.1497937Z 
2025-05-07T20:23:36.1498085Z ## Package Plan ##
2025-05-07T20:23:36.1498245Z 
2025-05-07T20:23:36.1498979Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:36.1499242Z 
2025-05-07T20:23:36.1499344Z   added / updated specs:
2025-05-07T20:23:36.1499624Z     - conda-libmamba-solver
2025-05-07T20:23:36.1499893Z     - libarchive
2025-05-07T20:23:36.1500109Z     - libmamba
2025-05-07T20:23:36.1500327Z     - libmambapy
2025-05-07T20:23:36.1500477Z 
2025-05-07T20:23:36.1500482Z 
2025-05-07T20:23:36.1500627Z The following packages will be downloaded:
2025-05-07T20:23:36.1500848Z 
2025-05-07T20:23:36.1500978Z     package                    |            build
2025-05-07T20:23:36.1501299Z     ---------------------------|-----------------
2025-05-07T20:23:36.1501723Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:36.1502217Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:36.1502669Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:36.1503158Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:36.1503616Z     ------------------------------------------------------------
2025-05-07T20:23:36.1503965Z                                            Total:         1.4 MB
2025-05-07T20:23:36.1504175Z 
2025-05-07T20:23:36.1504288Z The following packages will be UPDATED:
2025-05-07T20:23:36.1504507Z 
2025-05-07T20:23:36.1508348Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:36.1509311Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:36.1509704Z 
2025-05-07T20:23:36.1509940Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:36.1510266Z 
2025-05-07T20:23:36.1510604Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:36.1511425Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:36.1511931Z 
2025-05-07T20:23:36.1511935Z 
2025-05-07T20:23:36.1511940Z 
2025-05-07T20:23:36.1512084Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:36.1512461Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:36.1512691Z 
2025-05-07T20:23:36.1513266Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:36.1513516Z 
2025-05-07T20:23:36.1513520Z 
2025-05-07T20:23:36.1527818Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:36.1528188Z 
2025-05-07T20:23:36.1528192Z 
2025-05-07T20:23:36.1530201Z 
2025-05-07T20:23:36.2066923Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:36.2067363Z 
2025-05-07T20:23:36.2067371Z 
2025-05-07T20:23:36.2068160Z 
2025-05-07T20:23:36.2177378Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:36.2178034Z 
2025-05-07T20:23:36.2185436Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:36.2185685Z 
2025-05-07T20:23:36.2185791Z 
2025-05-07T20:23:36.2185894Z 
2025-05-07T20:23:36.2239718Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:36.2239988Z 
2025-05-07T20:23:36.2240120Z 
2025-05-07T20:23:36.2343594Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:36.2344171Z 
2025-05-07T20:23:36.2363439Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:36.2363694Z 
2025-05-07T20:23:36.2363699Z 
2025-05-07T20:23:36.3510123Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:36.3910651Z conda-25.3.1         | 1.1 MB    | 1          |   1% 
2025-05-07T20:23:36.4989056Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:36.4989470Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:36.4994910Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:36.4995247Z                                                      
2025-05-07T20:23:36.4995456Z 
2025-05-07T20:23:36.4995775Z                                                      [A
2025-05-07T20:23:36.4995981Z 
2025-05-07T20:23:36.4995985Z 
2025-05-07T20:23:36.4996160Z                                                      [A[A
2025-05-07T20:23:36.4996369Z 
2025-05-07T20:23:36.4996399Z 
2025-05-07T20:23:36.4996403Z 
2025-05-07T20:23:36.4996600Z                                                      [A[A[A done
2025-05-07T20:23:36.6000173Z Preparing transaction: - done
2025-05-07T20:23:36.7002711Z Verifying transaction: | done
2025-05-07T20:23:38.0021936Z Executing transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:23:39.7173193Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:39.7198512Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:40.6518261Z Channels:
2025-05-07T20:23:40.6518510Z  - defaults
2025-05-07T20:23:40.6518730Z Platform: linux-64
2025-05-07T20:23:41.8739599Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:41.9937039Z Solving environment: - \ Channels:
2025-05-07T20:23:41.9937486Z  - defaults
2025-05-07T20:23:41.9937718Z Platform: linux-64
2025-05-07T20:23:42.2834572Z Collecting package metadata (repodata.json): / - \ done
2025-05-07T20:23:42.4939455Z Solving environment: / - \ | done
2025-05-07T20:23:42.5815625Z done
2025-05-07T20:23:42.6470811Z 
2025-05-07T20:23:42.6470954Z ## Package Plan ##
2025-05-07T20:23:42.6471123Z 
2025-05-07T20:23:42.6471293Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:42.6471535Z 
2025-05-07T20:23:42.6471639Z   added / updated specs:
2025-05-07T20:23:42.6471899Z     - conda
2025-05-07T20:23:42.6472023Z 
2025-05-07T20:23:42.6472028Z 
2025-05-07T20:23:42.6472160Z The following packages will be downloaded:
2025-05-07T20:23:42.6472384Z 
2025-05-07T20:23:42.6472507Z     package                    |            build
2025-05-07T20:23:42.6472852Z     ---------------------------|-----------------
2025-05-07T20:23:42.6473212Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:42.6473604Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:42.6473982Z     ------------------------------------------------------------
2025-05-07T20:23:42.6474591Z                                            Total:         1.4 MB
2025-05-07T20:23:42.6474811Z 
2025-05-07T20:23:42.6474936Z The following packages will be UPDATED:
2025-05-07T20:23:42.6475149Z 
2025-05-07T20:23:42.6475464Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:42.6476005Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:42.6476288Z 
2025-05-07T20:23:42.6476291Z 
2025-05-07T20:23:42.6476295Z 
2025-05-07T20:23:42.6476442Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:42.6476821Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:42.6477039Z 
2025-05-07T20:23:42.6844328Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:42.6845564Z 
2025-05-07T20:23:42.7193436Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:42.8982017Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:42.8982602Z 
2025-05-07T20:23:42.8985047Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:42.8985292Z 
2025-05-07T20:23:42.9111862Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:42.9112274Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:42.9115923Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:42.9116246Z                                                      
2025-05-07T20:23:42.9116442Z 
2025-05-07T20:23:42.9116838Z                                                      [A done
2025-05-07T20:23:43.0121223Z Preparing transaction: - done
2025-05-07T20:23:43.1127660Z Verifying transaction: | done
2025-05-07T20:23:45.2154196Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:45.8382925Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:45.8386833Z + conda clean --packages --tarball -y
2025-05-07T20:23:45.8387135Z 
2025-05-07T20:23:46.8403730Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:46.8404078Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:46.9080864Z 
2025-05-07T20:23:46.9088187Z + conda clean --all -y
2025-05-07T20:23:46.9088395Z 
2025-05-07T20:23:47.4468569Z There are no unused tarball(s) to remove.
2025-05-07T20:23:47.4469262Z Will remove 1 index cache(s).
2025-05-07T20:23:47.4469821Z There are no unused package(s) to remove.
2025-05-07T20:23:47.4470431Z There are no tempfile(s) to remove.
2025-05-07T20:23:47.4470999Z There are no logfile(s) to remove.
2025-05-07T20:23:47.5091432Z 
2025-05-07T20:23:47.5096701Z + conda info
2025-05-07T20:23:47.5096961Z 
2025-05-07T20:23:48.2607221Z 
2025-05-07T20:23:48.2607853Z      active environment : base
2025-05-07T20:23:48.2608202Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:48.2608528Z             shell level : 1
2025-05-07T20:23:48.2608801Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:48.2609213Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:48.2609588Z           conda version : 25.3.1
2025-05-07T20:23:48.2609861Z     conda-build version : not installed
2025-05-07T20:23:48.2610153Z          python version : 3.13.2.final.0
2025-05-07T20:23:48.2610449Z                  solver : libmamba (default)
2025-05-07T20:23:48.2610751Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:48.2611047Z                           __conda=25.3.1=0
2025-05-07T20:23:48.2611317Z                           __cuda=12.8=0
2025-05-07T20:23:48.2611588Z                           __glibc=2.34=0
2025-05-07T20:23:48.2611868Z                           __linux=6.1.130=0
2025-05-07T20:23:48.2612195Z                           __unix=0=0
2025-05-07T20:23:48.2612528Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:48.2612939Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:48.2613287Z   conda av metadata url : None
2025-05-07T20:23:48.2613987Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:48.2614424Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:48.2614808Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:48.2615192Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:48.2615567Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:48.2615911Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:48.2616250Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:48.2616594Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:48.2616901Z                platform : linux-64
2025-05-07T20:23:48.2617743Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:48.2618741Z                 UID:GID : 1000:1000
2025-05-07T20:23:48.2619027Z              netrc file : None
2025-05-07T20:23:48.2619291Z            offline mode : False
2025-05-07T20:23:48.2619460Z 
2025-05-07T20:23:48.3265584Z 
2025-05-07T20:23:48.3266094Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:48.3266872Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_253e7459-3102-448d-a886-17ea95ebc735 ...
2025-05-07T20:23:48.3267676Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:48.3354036Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12
2025-05-07T20:23:48.3354526Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.12[0m
2025-05-07T20:23:48.3373611Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:48.3373962Z env:
2025-05-07T20:23:48.3374188Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:48.3374509Z   BUILD_ENV: build_binary
2025-05-07T20:23:48.3374758Z   BUILD_TARGET: genai
2025-05-07T20:23:48.3374992Z   BUILD_VARIANT: cuda
2025-05-07T20:23:48.3375223Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:48.3375479Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:48.3375781Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:48.3376111Z ##[endgroup]
2025-05-07T20:23:48.6746163Z ################################################################################
2025-05-07T20:23:48.6746553Z # Create Conda Environment
2025-05-07T20:23:48.6746798Z #
2025-05-07T20:23:48.6761393Z # [2025-05-07T20:23:48.675Z] + create_conda_environment build_binary 3.12
2025-05-07T20:23:48.6761815Z ################################################################################
2025-05-07T20:23:48.6762033Z 
2025-05-07T20:23:48.6778051Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:48.7649440Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:48.7649848Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:48.7650198Z + conda info --envs
2025-05-07T20:23:48.7650351Z 
2025-05-07T20:23:49.5163162Z 
2025-05-07T20:23:49.5163813Z # conda environments:
2025-05-07T20:23:49.5164092Z #
2025-05-07T20:23:49.5164310Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:49.5164538Z 
2025-05-07T20:23:49.5821643Z 
2025-05-07T20:23:49.5822436Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:51.2092965Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:51.2093270Z 
2025-05-07T20:23:51.2105787Z 
2025-05-07T20:23:51.2116079Z [SETUP] Creating new Conda environment (Python 3.12) ...
2025-05-07T20:23:51.2139396Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.12
2025-05-07T20:23:51.9656888Z Channels:
2025-05-07T20:23:51.9657143Z  - defaults
2025-05-07T20:23:51.9657359Z Platform: linux-64
2025-05-07T20:23:53.5321446Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:23:53.6580788Z Solving environment: / done
2025-05-07T20:23:53.6867961Z 
2025-05-07T20:23:53.6868285Z ## Package Plan ##
2025-05-07T20:23:53.6868491Z 
2025-05-07T20:23:53.6868731Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:53.6869183Z 
2025-05-07T20:23:53.6869318Z   added / updated specs:
2025-05-07T20:23:53.6869610Z     - python=3.12
2025-05-07T20:23:53.6869744Z 
2025-05-07T20:23:53.6869748Z 
2025-05-07T20:23:53.6869894Z The following packages will be downloaded:
2025-05-07T20:23:53.6870112Z 
2025-05-07T20:23:53.6870259Z     package                    |            build
2025-05-07T20:23:53.6870579Z     ---------------------------|-----------------
2025-05-07T20:23:53.6870941Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:23:53.6871348Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:23:53.6871875Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:23:53.6872771Z     python-3.12.9              |       h5148396_0        34.7 MB
2025-05-07T20:23:53.6873181Z     setuptools-78.1.1          |  py312h06a4308_0         2.2 MB
2025-05-07T20:23:53.6873569Z     wheel-0.45.1               |  py312h06a4308_0         147 KB
2025-05-07T20:23:53.6873944Z     ------------------------------------------------------------
2025-05-07T20:23:53.6874287Z                                            Total:        37.2 MB
2025-05-07T20:23:53.6874495Z 
2025-05-07T20:23:53.6874631Z The following NEW packages will be INSTALLED:
2025-05-07T20:23:53.6874855Z 
2025-05-07T20:23:53.6875279Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:23:53.6875737Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:23:53.6884697Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:23:53.6885363Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:23:53.6885859Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:23:53.6886327Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:23:53.6886790Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:23:53.6887225Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:23:53.6887665Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:23:53.6888130Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:23:53.6888588Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:23:53.6889011Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:23:53.6889433Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:23:53.6889834Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:53.6890240Z   python             pkgs/main/linux-64::python-3.12.9-h5148396_0 
2025-05-07T20:23:53.6890678Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:23:53.6891154Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 
2025-05-07T20:23:53.6891618Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:23:53.6892132Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:23:53.6892523Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:23:53.6893016Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 
2025-05-07T20:23:53.6893542Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:23:53.6893949Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:23:53.6894191Z 
2025-05-07T20:23:53.6894196Z 
2025-05-07T20:23:53.6894211Z 
2025-05-07T20:23:53.6894355Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:53.6894745Z python-3.12.9        | 34.7 MB   |            |   0% 
2025-05-07T20:23:53.6894982Z 
2025-05-07T20:23:53.6895319Z setuptools-78.1.1    | 2.2 MB    |            |   0% [A
2025-05-07T20:23:53.6895573Z 
2025-05-07T20:23:53.6895577Z 
2025-05-07T20:23:53.6895774Z wheel-0.45.1         | 147 KB    |            |   0% [A[A
2025-05-07T20:23:53.6896013Z 
2025-05-07T20:23:53.6896017Z 
2025-05-07T20:23:53.6896021Z 
2025-05-07T20:23:53.6920066Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:23:53.6920382Z 
2025-05-07T20:23:53.6920395Z 
2025-05-07T20:23:53.6920398Z 
2025-05-07T20:23:53.6920402Z 
2025-05-07T20:23:53.6934999Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:23:53.6935293Z 
2025-05-07T20:23:53.6935297Z 
2025-05-07T20:23:53.6935312Z 
2025-05-07T20:23:53.6935316Z 
2025-05-07T20:23:53.6935324Z 
2025-05-07T20:23:53.7331730Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:23:53.7332074Z 
2025-05-07T20:23:53.7332078Z 
2025-05-07T20:23:53.7334060Z 
2025-05-07T20:23:53.7510164Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:23:53.7510694Z 
2025-05-07T20:23:53.7515649Z 
2025-05-07T20:23:53.7645615Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:23:53.7645874Z 
2025-05-07T20:23:53.7645878Z 
2025-05-07T20:23:53.7645882Z 
2025-05-07T20:23:53.7645885Z 
2025-05-07T20:23:53.7650285Z 
2025-05-07T20:23:53.7774160Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:23:53.7774438Z 
2025-05-07T20:23:53.7774442Z 
2025-05-07T20:23:53.7775941Z 
2025-05-07T20:23:53.7874066Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:23:53.7874858Z python-3.12.9        | 34.7 MB   | 6          |   6% 
2025-05-07T20:23:53.7875776Z 
2025-05-07T20:23:53.7902816Z setuptools-78.1.1    | 2.2 MB    | #######3   |  73% [A
2025-05-07T20:23:53.7903089Z 
2025-05-07T20:23:53.7903093Z 
2025-05-07T20:23:53.7903096Z 
2025-05-07T20:23:53.7903100Z 
2025-05-07T20:23:53.7903656Z 
2025-05-07T20:23:53.7929799Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:23:53.7930164Z 
2025-05-07T20:23:53.7930169Z 
2025-05-07T20:23:53.7930173Z 
2025-05-07T20:23:53.7930176Z 
2025-05-07T20:23:53.7963812Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:23:53.7964095Z 
2025-05-07T20:23:53.7964099Z 
2025-05-07T20:23:53.7964103Z 
2025-05-07T20:23:53.7964107Z 
2025-05-07T20:23:53.8396072Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:23:53.8397525Z 
2025-05-07T20:23:53.8878892Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:23:53.8935509Z python-3.12.9        | 34.7 MB   | ##1        |  21% 
2025-05-07T20:23:53.8935781Z 
2025-05-07T20:23:53.8935787Z 
2025-05-07T20:23:53.8935792Z 
2025-05-07T20:23:53.8936610Z 
2025-05-07T20:23:53.9073506Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:23:53.9073785Z 
2025-05-07T20:23:53.9074024Z 
2025-05-07T20:23:53.9080175Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:23:53.9080433Z 
2025-05-07T20:23:53.9080761Z 
2025-05-07T20:23:53.9880492Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:23:54.1447033Z python-3.12.9        | 34.7 MB   | ######2    |  62% 
2025-05-07T20:23:54.1450505Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:23:54.2189616Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:23:54.2189962Z 
2025-05-07T20:23:54.8014434Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:23:54.8021400Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:23:54.8021770Z                                                      
2025-05-07T20:23:54.8021983Z 
2025-05-07T20:23:54.8022180Z                                                      [A
2025-05-07T20:23:54.8022392Z 
2025-05-07T20:23:54.8022396Z 
2025-05-07T20:23:54.8022560Z                                                      [A[A
2025-05-07T20:23:54.8022767Z 
2025-05-07T20:23:54.8022771Z 
2025-05-07T20:23:54.8022788Z 
2025-05-07T20:23:54.8022965Z                                                      [A[A[A
2025-05-07T20:23:54.8023170Z 
2025-05-07T20:23:54.8023174Z 
2025-05-07T20:23:54.8023178Z 
2025-05-07T20:23:54.8023181Z 
2025-05-07T20:23:54.8023363Z                                                      [A[A[A[A
2025-05-07T20:23:54.8023573Z 
2025-05-07T20:23:54.8023577Z 
2025-05-07T20:23:54.8023580Z 
2025-05-07T20:23:54.8023584Z 
2025-05-07T20:23:54.8023587Z 
2025-05-07T20:23:54.8023799Z                                                      [A[A[A[A[A done
2025-05-07T20:23:55.0130064Z Preparing transaction: \ | done
2025-05-07T20:23:56.4296371Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:23:58.7442947Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:58.7951473Z #
2025-05-07T20:23:58.7951736Z # To activate this environment, use
2025-05-07T20:23:58.7952313Z #
2025-05-07T20:23:58.7952527Z #     $ conda activate build_binary
2025-05-07T20:23:58.7952799Z #
2025-05-07T20:23:58.7953011Z # To deactivate an active environment, use
2025-05-07T20:23:58.7953305Z #
2025-05-07T20:23:58.7953497Z #     $ conda deactivate
2025-05-07T20:23:58.7953650Z 
2025-05-07T20:23:58.9018053Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:23:58.9039846Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:01.8772537Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1)
2025-05-07T20:24:01.8773917Z Collecting pip
2025-05-07T20:24:01.8774386Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:01.8775001Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:01.8777786Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 102.2 MB/s eta 0:00:00
2025-05-07T20:24:01.8778162Z Installing collected packages: pip
2025-05-07T20:24:01.8778485Z   Attempting uninstall: pip
2025-05-07T20:24:01.8778771Z     Found existing installation: pip 25.1
2025-05-07T20:24:01.8779088Z     Uninstalling pip-25.1:
2025-05-07T20:24:01.8779370Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:01.8779680Z Successfully installed pip-25.1.1
2025-05-07T20:24:01.8779878Z 
2025-05-07T20:24:01.9427662Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:01.9450589Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:02.7967786Z Channels:
2025-05-07T20:24:02.7968050Z  - conda-forge
2025-05-07T20:24:02.7968303Z Platform: linux-64
2025-05-07T20:24:13.2466161Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:14.9612698Z Solving environment: / - \ | / done
2025-05-07T20:24:15.0227407Z 
2025-05-07T20:24:15.0227702Z ## Package Plan ##
2025-05-07T20:24:15.0227943Z 
2025-05-07T20:24:15.0228196Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:15.0228541Z 
2025-05-07T20:24:15.0228640Z   added / updated specs:
2025-05-07T20:24:15.0228917Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:15.0229106Z 
2025-05-07T20:24:15.0229110Z 
2025-05-07T20:24:15.0229239Z The following packages will be downloaded:
2025-05-07T20:24:15.0229457Z 
2025-05-07T20:24:15.0229581Z     package                    |            build
2025-05-07T20:24:15.0229906Z     ---------------------------|-----------------
2025-05-07T20:24:15.0230317Z     cffi-1.17.1                |  py312h06ac9bb_0         288 KB  conda-forge
2025-05-07T20:24:15.0230887Z     cryptography-44.0.3        |  py312hda17c39_0         1.5 MB  conda-forge
2025-05-07T20:24:15.0231511Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:24:15.0232061Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:24:15.0232628Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:15.0233071Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:15.0233488Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:15.0233901Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:24:15.0234319Z     libsqlite-3.46.0           |       hde9e2c9_0         845 KB  conda-forge
2025-05-07T20:24:15.0234743Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:24:15.0235194Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:15.0235643Z     libzlib-1.2.13             |       h4ab18f5_6          60 KB  conda-forge
2025-05-07T20:24:15.0236066Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:15.0236491Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:15.0237278Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:15.0237724Z     python-3.12.2              |hab00c5b_0_cpython        30.8 MB  conda-forge
2025-05-07T20:24:15.0238154Z     python_abi-3.12            |          7_cp312           7 KB  conda-forge
2025-05-07T20:24:15.0238611Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:15.0239095Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:15.0239723Z     zlib-1.2.13                |       h4ab18f5_6          91 KB  conda-forge
2025-05-07T20:24:15.0240277Z     ------------------------------------------------------------
2025-05-07T20:24:15.0240737Z                                            Total:        38.6 MB
2025-05-07T20:24:15.0241034Z 
2025-05-07T20:24:15.0241187Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:15.0241417Z 
2025-05-07T20:24:15.0241613Z   cffi               conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 
2025-05-07T20:24:15.0242125Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 
2025-05-07T20:24:15.0242637Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:24:15.0243080Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:15.0243512Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:24:15.0245585Z   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
2025-05-07T20:24:15.0246131Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:15.0246646Z   libzlib            conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 
2025-05-07T20:24:15.0247160Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:15.0247698Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:15.0248216Z   python_abi         conda-forge/noarch::python_abi-3.12-7_cp312 
2025-05-07T20:24:15.0248799Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:15.0249490Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:15.0249984Z 
2025-05-07T20:24:15.0250153Z The following packages will be UPDATED:
2025-05-07T20:24:15.0250481Z 
2025-05-07T20:24:15.0250987Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:15.0251773Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:15.0252548Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:15.0253186Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:24:15.0253817Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:15.0254425Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 
2025-05-07T20:24:15.0254766Z 
2025-05-07T20:24:15.0254990Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:15.0255335Z 
2025-05-07T20:24:15.0255607Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:24:15.0256234Z   python                pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 
2025-05-07T20:24:15.0256620Z 
2025-05-07T20:24:15.0256629Z 
2025-05-07T20:24:15.0256633Z 
2025-05-07T20:24:15.0256775Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:15.0257165Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:24:15.0257398Z 
2025-05-07T20:24:15.0257756Z openssl-3.5.0        | 3.0 MB    |            |   0% [A
2025-05-07T20:24:15.0257991Z 
2025-05-07T20:24:15.0257995Z 
2025-05-07T20:24:15.0258221Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:24:15.0258603Z 
2025-05-07T20:24:15.0258607Z 
2025-05-07T20:24:15.0258610Z 
2025-05-07T20:24:15.0280711Z libsqlite-3.46.0     | 845 KB    |            |   0% [A[A[A
2025-05-07T20:24:15.0281314Z 
2025-05-07T20:24:15.0281320Z 
2025-05-07T20:24:15.0281325Z 
2025-05-07T20:24:15.0281336Z 
2025-05-07T20:24:15.0293069Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:15.0293434Z 
2025-05-07T20:24:15.0293439Z 
2025-05-07T20:24:15.0293444Z 
2025-05-07T20:24:15.0293450Z 
2025-05-07T20:24:15.0298548Z 
2025-05-07T20:24:15.0300556Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:15.0300927Z 
2025-05-07T20:24:15.0300932Z 
2025-05-07T20:24:15.0300937Z 
2025-05-07T20:24:15.0300942Z 
2025-05-07T20:24:15.0300947Z 
2025-05-07T20:24:15.0300952Z 
2025-05-07T20:24:15.0303420Z cffi-1.17.1          | 288 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:15.0303771Z 
2025-05-07T20:24:15.0303788Z 
2025-05-07T20:24:15.0303793Z 
2025-05-07T20:24:15.0303798Z 
2025-05-07T20:24:15.0303803Z 
2025-05-07T20:24:15.0303808Z 
2025-05-07T20:24:15.0303813Z 
2025-05-07T20:24:15.0315339Z expat-2.7.0          | 137 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:15.0315975Z 
2025-05-07T20:24:15.0315981Z 
2025-05-07T20:24:15.0315986Z 
2025-05-07T20:24:15.0315991Z 
2025-05-07T20:24:15.0315996Z 
2025-05-07T20:24:15.0316001Z 
2025-05-07T20:24:15.0316006Z 
2025-05-07T20:24:15.0316011Z 
2025-05-07T20:24:15.0317153Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0317547Z 
2025-05-07T20:24:15.0317553Z 
2025-05-07T20:24:15.0317558Z 
2025-05-07T20:24:15.0317563Z 
2025-05-07T20:24:15.0317568Z 
2025-05-07T20:24:15.0317573Z 
2025-05-07T20:24:15.0317591Z 
2025-05-07T20:24:15.0317597Z 
2025-05-07T20:24:15.0317602Z 
2025-05-07T20:24:15.0318900Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0319291Z 
2025-05-07T20:24:15.0319296Z 
2025-05-07T20:24:15.0319310Z 
2025-05-07T20:24:15.0319315Z 
2025-05-07T20:24:15.0319320Z 
2025-05-07T20:24:15.0319326Z 
2025-05-07T20:24:15.0319331Z 
2025-05-07T20:24:15.0319336Z 
2025-05-07T20:24:15.0319341Z 
2025-05-07T20:24:15.0319350Z 
2025-05-07T20:24:15.0322977Z libxcrypt-4.4.36     | 98 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0323375Z 
2025-05-07T20:24:15.0323381Z 
2025-05-07T20:24:15.0323385Z 
2025-05-07T20:24:15.0323391Z 
2025-05-07T20:24:15.0323395Z 
2025-05-07T20:24:15.0323401Z 
2025-05-07T20:24:15.0323406Z 
2025-05-07T20:24:15.0323419Z 
2025-05-07T20:24:15.0323425Z 
2025-05-07T20:24:15.0323430Z 
2025-05-07T20:24:15.0323435Z 
2025-05-07T20:24:15.0324475Z zlib-1.2.13          | 91 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0324829Z 
2025-05-07T20:24:15.0324834Z 
2025-05-07T20:24:15.0324840Z 
2025-05-07T20:24:15.0324845Z 
2025-05-07T20:24:15.0324850Z 
2025-05-07T20:24:15.0324855Z 
2025-05-07T20:24:15.0324867Z 
2025-05-07T20:24:15.0324873Z 
2025-05-07T20:24:15.0324878Z 
2025-05-07T20:24:15.0324883Z 
2025-05-07T20:24:15.0324893Z 
2025-05-07T20:24:15.0324898Z 
2025-05-07T20:24:15.0329279Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0329651Z 
2025-05-07T20:24:15.0329655Z 
2025-05-07T20:24:15.0329658Z 
2025-05-07T20:24:15.0329662Z 
2025-05-07T20:24:15.0329665Z 
2025-05-07T20:24:15.0329676Z 
2025-05-07T20:24:15.0329679Z 
2025-05-07T20:24:15.0329683Z 
2025-05-07T20:24:15.0329686Z 
2025-05-07T20:24:15.0329690Z 
2025-05-07T20:24:15.0329693Z 
2025-05-07T20:24:15.0329704Z 
2025-05-07T20:24:15.0329708Z 
2025-05-07T20:24:15.0330630Z libexpat-2.7.0       | 73 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0330925Z 
2025-05-07T20:24:15.0330929Z 
2025-05-07T20:24:15.0330932Z 
2025-05-07T20:24:15.0330942Z 
2025-05-07T20:24:15.0330946Z 
2025-05-07T20:24:15.0330949Z 
2025-05-07T20:24:15.0330953Z 
2025-05-07T20:24:15.0331174Z 
2025-05-07T20:24:15.0331178Z 
2025-05-07T20:24:15.0331182Z 
2025-05-07T20:24:15.0331185Z 
2025-05-07T20:24:15.0331189Z 
2025-05-07T20:24:15.0331192Z 
2025-05-07T20:24:15.0331196Z 
2025-05-07T20:24:15.0332279Z libzlib-1.2.13       | 60 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0332595Z 
2025-05-07T20:24:15.0332599Z 
2025-05-07T20:24:15.0332603Z 
2025-05-07T20:24:15.0332606Z 
2025-05-07T20:24:15.0332610Z 
2025-05-07T20:24:15.0332614Z 
2025-05-07T20:24:15.0332623Z 
2025-05-07T20:24:15.0332627Z 
2025-05-07T20:24:15.0332630Z 
2025-05-07T20:24:15.0332809Z 
2025-05-07T20:24:15.0332814Z 
2025-05-07T20:24:15.0332818Z 
2025-05-07T20:24:15.0332821Z 
2025-05-07T20:24:15.0332827Z 
2025-05-07T20:24:15.0332832Z 
2025-05-07T20:24:15.0333491Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0333820Z 
2025-05-07T20:24:15.0333833Z 
2025-05-07T20:24:15.0333837Z 
2025-05-07T20:24:15.0333840Z 
2025-05-07T20:24:15.0333865Z 
2025-05-07T20:24:15.0333869Z 
2025-05-07T20:24:15.0333873Z 
2025-05-07T20:24:15.0333876Z 
2025-05-07T20:24:15.0333880Z 
2025-05-07T20:24:15.0333884Z 
2025-05-07T20:24:15.0333887Z 
2025-05-07T20:24:15.0333891Z 
2025-05-07T20:24:15.0333895Z 
2025-05-07T20:24:15.0333898Z 
2025-05-07T20:24:15.0333902Z 
2025-05-07T20:24:15.0333906Z 
2025-05-07T20:24:15.0334720Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0335052Z 
2025-05-07T20:24:15.0335057Z 
2025-05-07T20:24:15.0335060Z 
2025-05-07T20:24:15.0335064Z 
2025-05-07T20:24:15.0335074Z 
2025-05-07T20:24:15.0335079Z 
2025-05-07T20:24:15.0335083Z 
2025-05-07T20:24:15.0335087Z 
2025-05-07T20:24:15.0335091Z 
2025-05-07T20:24:15.0335095Z 
2025-05-07T20:24:15.0335098Z 
2025-05-07T20:24:15.0335102Z 
2025-05-07T20:24:15.0335106Z 
2025-05-07T20:24:15.0335109Z 
2025-05-07T20:24:15.0335124Z 
2025-05-07T20:24:15.0335128Z 
2025-05-07T20:24:15.0335131Z 
2025-05-07T20:24:15.0335924Z libuuid-2.38.1       | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0336275Z 
2025-05-07T20:24:15.0336298Z 
2025-05-07T20:24:15.0336301Z 
2025-05-07T20:24:15.0336305Z 
2025-05-07T20:24:15.0336308Z 
2025-05-07T20:24:15.0336312Z 
2025-05-07T20:24:15.0336315Z 
2025-05-07T20:24:15.0336319Z 
2025-05-07T20:24:15.0336323Z 
2025-05-07T20:24:15.0336326Z 
2025-05-07T20:24:15.0336330Z 
2025-05-07T20:24:15.0336333Z 
2025-05-07T20:24:15.0336337Z 
2025-05-07T20:24:15.0336340Z 
2025-05-07T20:24:15.0336344Z 
2025-05-07T20:24:15.0336347Z 
2025-05-07T20:24:15.0336351Z 
2025-05-07T20:24:15.0336360Z 
2025-05-07T20:24:15.0337098Z libnsl-2.0.1         | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0337443Z 
2025-05-07T20:24:15.0337447Z 
2025-05-07T20:24:15.0337450Z 
2025-05-07T20:24:15.0337454Z 
2025-05-07T20:24:15.0337457Z 
2025-05-07T20:24:15.0337461Z 
2025-05-07T20:24:15.0337465Z 
2025-05-07T20:24:15.0337468Z 
2025-05-07T20:24:15.0337476Z 
2025-05-07T20:24:15.0337488Z 
2025-05-07T20:24:15.0337491Z 
2025-05-07T20:24:15.0337495Z 
2025-05-07T20:24:15.0337498Z 
2025-05-07T20:24:15.0337502Z 
2025-05-07T20:24:15.0337506Z 
2025-05-07T20:24:15.0337509Z 
2025-05-07T20:24:15.0337513Z 
2025-05-07T20:24:15.0337516Z 
2025-05-07T20:24:15.0337520Z 
2025-05-07T20:24:15.0959496Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.0960027Z 
2025-05-07T20:24:15.0962334Z 
2025-05-07T20:24:15.1120329Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:24:15.1120683Z 
2025-05-07T20:24:15.1120710Z 
2025-05-07T20:24:15.1120717Z 
2025-05-07T20:24:15.1233505Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:15.1234800Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:24:15.1235238Z 
2025-05-07T20:24:15.1285213Z openssl-3.5.0        | 3.0 MB    | #2         |  13% [A
2025-05-07T20:24:15.1285505Z 
2025-05-07T20:24:15.1285713Z 
2025-05-07T20:24:15.1285717Z 
2025-05-07T20:24:15.1288469Z 
2025-05-07T20:24:15.1318646Z libgcc-15.1.0        | 810 KB    | ######5    |  65% [A[A[A[A
2025-05-07T20:24:15.1318998Z 
2025-05-07T20:24:15.1319003Z 
2025-05-07T20:24:15.1319006Z 
2025-05-07T20:24:15.1319010Z 
2025-05-07T20:24:15.1319014Z 
2025-05-07T20:24:15.1556621Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A[A[A
2025-05-07T20:24:15.1556999Z 
2025-05-07T20:24:15.1557006Z 
2025-05-07T20:24:15.1557108Z 
2025-05-07T20:24:15.1557115Z 
2025-05-07T20:24:15.1557121Z 
2025-05-07T20:24:15.1557126Z 
2025-05-07T20:24:15.1596514Z cffi-1.17.1          | 288 KB    | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:24:15.1596798Z 
2025-05-07T20:24:15.1596802Z 
2025-05-07T20:24:15.1596806Z 
2025-05-07T20:24:15.1596809Z 
2025-05-07T20:24:15.1601406Z 
2025-05-07T20:24:15.1712095Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:15.1712455Z 
2025-05-07T20:24:15.1712460Z 
2025-05-07T20:24:15.1712481Z 
2025-05-07T20:24:15.1712489Z 
2025-05-07T20:24:15.1804977Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:15.1805359Z 
2025-05-07T20:24:15.1805366Z 
2025-05-07T20:24:15.1805373Z 
2025-05-07T20:24:15.1805379Z 
2025-05-07T20:24:15.1805386Z 
2025-05-07T20:24:15.1805393Z 
2025-05-07T20:24:15.2050822Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:15.2051111Z 
2025-05-07T20:24:15.2051115Z 
2025-05-07T20:24:15.2051119Z 
2025-05-07T20:24:15.2051123Z 
2025-05-07T20:24:15.2051126Z 
2025-05-07T20:24:15.2051137Z 
2025-05-07T20:24:15.2051147Z 
2025-05-07T20:24:15.2161187Z expat-2.7.0          | 137 KB    | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:24:15.2161478Z 
2025-05-07T20:24:15.2161482Z 
2025-05-07T20:24:15.2161559Z 
2025-05-07T20:24:15.2161564Z 
2025-05-07T20:24:15.2161568Z 
2025-05-07T20:24:15.2161572Z 
2025-05-07T20:24:15.2162981Z 
2025-05-07T20:24:15.2193836Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:15.2194133Z 
2025-05-07T20:24:15.2194138Z 
2025-05-07T20:24:15.2194142Z 
2025-05-07T20:24:15.2194145Z 
2025-05-07T20:24:15.2194149Z 
2025-05-07T20:24:15.2194153Z 
2025-05-07T20:24:15.2194156Z 
2025-05-07T20:24:15.2194164Z 
2025-05-07T20:24:15.2234020Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2284276Z python-3.12.2        | 30.8 MB   | 7          |   8% 
2025-05-07T20:24:15.2284614Z 
2025-05-07T20:24:15.2284619Z 
2025-05-07T20:24:15.2284631Z 
2025-05-07T20:24:15.2284635Z 
2025-05-07T20:24:15.2284638Z 
2025-05-07T20:24:15.2284651Z 
2025-05-07T20:24:15.2284654Z 
2025-05-07T20:24:15.2284658Z 
2025-05-07T20:24:15.2286029Z 
2025-05-07T20:24:15.2309543Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2309942Z 
2025-05-07T20:24:15.2309948Z 
2025-05-07T20:24:15.2309953Z 
2025-05-07T20:24:15.2309958Z 
2025-05-07T20:24:15.2309963Z 
2025-05-07T20:24:15.2309968Z 
2025-05-07T20:24:15.2309987Z 
2025-05-07T20:24:15.2313737Z 
2025-05-07T20:24:15.2396432Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2396841Z 
2025-05-07T20:24:15.2396846Z 
2025-05-07T20:24:15.2396851Z 
2025-05-07T20:24:15.2396855Z 
2025-05-07T20:24:15.2396860Z 
2025-05-07T20:24:15.2396865Z 
2025-05-07T20:24:15.2396870Z 
2025-05-07T20:24:15.2396875Z 
2025-05-07T20:24:15.2396879Z 
2025-05-07T20:24:15.2480819Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2481214Z 
2025-05-07T20:24:15.2481219Z 
2025-05-07T20:24:15.2481233Z 
2025-05-07T20:24:15.2493818Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:15.2494187Z 
2025-05-07T20:24:15.2494192Z 
2025-05-07T20:24:15.2494405Z 
2025-05-07T20:24:15.2553340Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:15.2553699Z 
2025-05-07T20:24:15.2553703Z 
2025-05-07T20:24:15.2553707Z 
2025-05-07T20:24:15.2553710Z 
2025-05-07T20:24:15.2553948Z 
2025-05-07T20:24:15.2553952Z 
2025-05-07T20:24:15.2553956Z 
2025-05-07T20:24:15.2553960Z 
2025-05-07T20:24:15.2553963Z 
2025-05-07T20:24:15.2553967Z 
2025-05-07T20:24:15.2637591Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2638008Z 
2025-05-07T20:24:15.2638014Z 
2025-05-07T20:24:15.2638019Z 
2025-05-07T20:24:15.2638024Z 
2025-05-07T20:24:15.2638029Z 
2025-05-07T20:24:15.2638043Z 
2025-05-07T20:24:15.2638047Z 
2025-05-07T20:24:15.2638051Z 
2025-05-07T20:24:15.2638054Z 
2025-05-07T20:24:15.2639317Z 
2025-05-07T20:24:15.2744025Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2744334Z 
2025-05-07T20:24:15.2744339Z 
2025-05-07T20:24:15.2744342Z 
2025-05-07T20:24:15.2744355Z 
2025-05-07T20:24:15.2744359Z 
2025-05-07T20:24:15.2744363Z 
2025-05-07T20:24:15.2744366Z 
2025-05-07T20:24:15.2744370Z 
2025-05-07T20:24:15.2744373Z 
2025-05-07T20:24:15.2744377Z 
2025-05-07T20:24:15.2744380Z 
2025-05-07T20:24:15.2806841Z zlib-1.2.13          | 91 KB     | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2807118Z 
2025-05-07T20:24:15.2807122Z 
2025-05-07T20:24:15.2807126Z 
2025-05-07T20:24:15.2807129Z 
2025-05-07T20:24:15.2807133Z 
2025-05-07T20:24:15.2807136Z 
2025-05-07T20:24:15.2807140Z 
2025-05-07T20:24:15.2807143Z 
2025-05-07T20:24:15.2807147Z 
2025-05-07T20:24:15.2807151Z 
2025-05-07T20:24:15.2807160Z 
2025-05-07T20:24:15.2921992Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2922377Z 
2025-05-07T20:24:15.2922394Z 
2025-05-07T20:24:15.2922400Z 
2025-05-07T20:24:15.2922405Z 
2025-05-07T20:24:15.2922410Z 
2025-05-07T20:24:15.2922415Z 
2025-05-07T20:24:15.2922420Z 
2025-05-07T20:24:15.2922425Z 
2025-05-07T20:24:15.2922430Z 
2025-05-07T20:24:15.2922443Z 
2025-05-07T20:24:15.2922447Z 
2025-05-07T20:24:15.2922451Z 
2025-05-07T20:24:15.2991472Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.2991923Z 
2025-05-07T20:24:15.2991937Z 
2025-05-07T20:24:15.2991944Z 
2025-05-07T20:24:15.2991949Z 
2025-05-07T20:24:15.2991954Z 
2025-05-07T20:24:15.2991960Z 
2025-05-07T20:24:15.2991964Z 
2025-05-07T20:24:15.2991969Z 
2025-05-07T20:24:15.2991974Z 
2025-05-07T20:24:15.2991979Z 
2025-05-07T20:24:15.2991984Z 
2025-05-07T20:24:15.2992045Z 
2025-05-07T20:24:15.3030303Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3030953Z 
2025-05-07T20:24:15.3031703Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:15.3031956Z 
2025-05-07T20:24:15.3070399Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:15.3070723Z 
2025-05-07T20:24:15.3070735Z 
2025-05-07T20:24:15.3070740Z 
2025-05-07T20:24:15.3070745Z 
2025-05-07T20:24:15.3070750Z 
2025-05-07T20:24:15.3070755Z 
2025-05-07T20:24:15.3070760Z 
2025-05-07T20:24:15.3070765Z 
2025-05-07T20:24:15.3070770Z 
2025-05-07T20:24:15.3070783Z 
2025-05-07T20:24:15.3070788Z 
2025-05-07T20:24:15.3070794Z 
2025-05-07T20:24:15.3070799Z 
2025-05-07T20:24:15.3121186Z libexpat-2.7.0       | 73 KB     | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3121490Z 
2025-05-07T20:24:15.3121493Z 
2025-05-07T20:24:15.3121497Z 
2025-05-07T20:24:15.3121501Z 
2025-05-07T20:24:15.3121504Z 
2025-05-07T20:24:15.3121508Z 
2025-05-07T20:24:15.3121511Z 
2025-05-07T20:24:15.3121515Z 
2025-05-07T20:24:15.3121518Z 
2025-05-07T20:24:15.3121522Z 
2025-05-07T20:24:15.3121525Z 
2025-05-07T20:24:15.3121529Z 
2025-05-07T20:24:15.3122757Z 
2025-05-07T20:24:15.3235856Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3236164Z 
2025-05-07T20:24:15.3236168Z 
2025-05-07T20:24:15.3236172Z 
2025-05-07T20:24:15.3236175Z 
2025-05-07T20:24:15.3236179Z 
2025-05-07T20:24:15.3236183Z 
2025-05-07T20:24:15.3236195Z 
2025-05-07T20:24:15.3236199Z 
2025-05-07T20:24:15.3236203Z 
2025-05-07T20:24:15.3236420Z 
2025-05-07T20:24:15.3236424Z 
2025-05-07T20:24:15.3236428Z 
2025-05-07T20:24:15.3236431Z 
2025-05-07T20:24:15.3237847Z 
2025-05-07T20:24:15.3270354Z libzlib-1.2.13       | 60 KB     | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3270663Z 
2025-05-07T20:24:15.3270667Z 
2025-05-07T20:24:15.3270671Z 
2025-05-07T20:24:15.3270675Z 
2025-05-07T20:24:15.3270678Z 
2025-05-07T20:24:15.3270682Z 
2025-05-07T20:24:15.3270685Z 
2025-05-07T20:24:15.3270689Z 
2025-05-07T20:24:15.3270693Z 
2025-05-07T20:24:15.3270696Z 
2025-05-07T20:24:15.3270700Z 
2025-05-07T20:24:15.3270703Z 
2025-05-07T20:24:15.3270862Z 
2025-05-07T20:24:15.3271364Z 
2025-05-07T20:24:15.3366470Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3408822Z python-3.12.2        | 30.8 MB   | #2         |  13% 
2025-05-07T20:24:15.3409075Z 
2025-05-07T20:24:15.3409101Z 
2025-05-07T20:24:15.3409106Z 
2025-05-07T20:24:15.3409127Z 
2025-05-07T20:24:15.3409132Z 
2025-05-07T20:24:15.3409167Z 
2025-05-07T20:24:15.3409172Z 
2025-05-07T20:24:15.3409178Z 
2025-05-07T20:24:15.3409182Z 
2025-05-07T20:24:15.3409186Z 
2025-05-07T20:24:15.3409206Z 
2025-05-07T20:24:15.3409225Z 
2025-05-07T20:24:15.3409236Z 
2025-05-07T20:24:15.3409309Z 
2025-05-07T20:24:15.3412582Z 
2025-05-07T20:24:15.3449085Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3449547Z 
2025-05-07T20:24:15.3449555Z 
2025-05-07T20:24:15.3449562Z 
2025-05-07T20:24:15.3449569Z 
2025-05-07T20:24:15.3451690Z 
2025-05-07T20:24:15.3459775Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:15.3460162Z 
2025-05-07T20:24:15.3460171Z 
2025-05-07T20:24:15.3460181Z 
2025-05-07T20:24:15.3460188Z 
2025-05-07T20:24:15.3460194Z 
2025-05-07T20:24:15.3460201Z 
2025-05-07T20:24:15.3460207Z 
2025-05-07T20:24:15.3460215Z 
2025-05-07T20:24:15.3460221Z 
2025-05-07T20:24:15.3460226Z 
2025-05-07T20:24:15.3460232Z 
2025-05-07T20:24:15.3460252Z 
2025-05-07T20:24:15.3460258Z 
2025-05-07T20:24:15.3460263Z 
2025-05-07T20:24:15.3461674Z 
2025-05-07T20:24:15.3473281Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3473722Z 
2025-05-07T20:24:15.3473727Z 
2025-05-07T20:24:15.3473733Z 
2025-05-07T20:24:15.3473738Z 
2025-05-07T20:24:15.3473753Z 
2025-05-07T20:24:15.3499217Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:15.3499585Z 
2025-05-07T20:24:15.3499590Z 
2025-05-07T20:24:15.3499596Z 
2025-05-07T20:24:15.3499601Z 
2025-05-07T20:24:15.3499624Z 
2025-05-07T20:24:15.3499629Z 
2025-05-07T20:24:15.3499634Z 
2025-05-07T20:24:15.3499640Z 
2025-05-07T20:24:15.3499645Z 
2025-05-07T20:24:15.3499650Z 
2025-05-07T20:24:15.3499655Z 
2025-05-07T20:24:15.3499660Z 
2025-05-07T20:24:15.3499665Z 
2025-05-07T20:24:15.3499670Z 
2025-05-07T20:24:15.3499675Z 
2025-05-07T20:24:15.3499680Z 
2025-05-07T20:24:15.3502506Z 
2025-05-07T20:24:15.3541314Z libuuid-2.38.1       | 33 KB     | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3541752Z 
2025-05-07T20:24:15.3541758Z 
2025-05-07T20:24:15.3541763Z 
2025-05-07T20:24:15.3541768Z 
2025-05-07T20:24:15.3541773Z 
2025-05-07T20:24:15.3541778Z 
2025-05-07T20:24:15.3541783Z 
2025-05-07T20:24:15.3541789Z 
2025-05-07T20:24:15.3541794Z 
2025-05-07T20:24:15.3541799Z 
2025-05-07T20:24:15.3541804Z 
2025-05-07T20:24:15.3541809Z 
2025-05-07T20:24:15.3541814Z 
2025-05-07T20:24:15.3541827Z 
2025-05-07T20:24:15.3541832Z 
2025-05-07T20:24:15.3543745Z 
2025-05-07T20:24:15.3571432Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3571860Z 
2025-05-07T20:24:15.3571866Z 
2025-05-07T20:24:15.3571871Z 
2025-05-07T20:24:15.3571876Z 
2025-05-07T20:24:15.3571881Z 
2025-05-07T20:24:15.3571886Z 
2025-05-07T20:24:15.3571891Z 
2025-05-07T20:24:15.3571896Z 
2025-05-07T20:24:15.3571901Z 
2025-05-07T20:24:15.3571906Z 
2025-05-07T20:24:15.3572210Z 
2025-05-07T20:24:15.3572215Z 
2025-05-07T20:24:15.3572220Z 
2025-05-07T20:24:15.3572225Z 
2025-05-07T20:24:15.3572230Z 
2025-05-07T20:24:15.3572235Z 
2025-05-07T20:24:15.3572240Z 
2025-05-07T20:24:15.3578744Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3579142Z 
2025-05-07T20:24:15.3579147Z 
2025-05-07T20:24:15.3579150Z 
2025-05-07T20:24:15.3579154Z 
2025-05-07T20:24:15.3579157Z 
2025-05-07T20:24:15.3579161Z 
2025-05-07T20:24:15.3579165Z 
2025-05-07T20:24:15.3579168Z 
2025-05-07T20:24:15.3579172Z 
2025-05-07T20:24:15.3579382Z 
2025-05-07T20:24:15.3579395Z 
2025-05-07T20:24:15.3579399Z 
2025-05-07T20:24:15.3579403Z 
2025-05-07T20:24:15.3579406Z 
2025-05-07T20:24:15.3579410Z 
2025-05-07T20:24:15.3580184Z 
2025-05-07T20:24:15.3718896Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3719220Z 
2025-05-07T20:24:15.3719224Z 
2025-05-07T20:24:15.3719242Z 
2025-05-07T20:24:15.3719246Z 
2025-05-07T20:24:15.3719249Z 
2025-05-07T20:24:15.3719262Z 
2025-05-07T20:24:15.3719266Z 
2025-05-07T20:24:15.3719270Z 
2025-05-07T20:24:15.3719273Z 
2025-05-07T20:24:15.3719277Z 
2025-05-07T20:24:15.3719280Z 
2025-05-07T20:24:15.3719284Z 
2025-05-07T20:24:15.3719287Z 
2025-05-07T20:24:15.3719291Z 
2025-05-07T20:24:15.3719294Z 
2025-05-07T20:24:15.3719298Z 
2025-05-07T20:24:15.3719302Z 
2025-05-07T20:24:15.3719305Z 
2025-05-07T20:24:15.3754264Z libnsl-2.0.1         | 33 KB     | ####9      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3754603Z 
2025-05-07T20:24:15.3754608Z 
2025-05-07T20:24:15.3754612Z 
2025-05-07T20:24:15.3754616Z 
2025-05-07T20:24:15.3754619Z 
2025-05-07T20:24:15.3754623Z 
2025-05-07T20:24:15.3754627Z 
2025-05-07T20:24:15.3754630Z 
2025-05-07T20:24:15.3754634Z 
2025-05-07T20:24:15.3754638Z 
2025-05-07T20:24:15.3754641Z 
2025-05-07T20:24:15.3754645Z 
2025-05-07T20:24:15.3754649Z 
2025-05-07T20:24:15.3754652Z 
2025-05-07T20:24:15.3754675Z 
2025-05-07T20:24:15.3754679Z 
2025-05-07T20:24:15.3754683Z 
2025-05-07T20:24:15.3757057Z 
2025-05-07T20:24:15.3820138Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3820617Z 
2025-05-07T20:24:15.3820624Z 
2025-05-07T20:24:15.3820629Z 
2025-05-07T20:24:15.3820634Z 
2025-05-07T20:24:15.3820639Z 
2025-05-07T20:24:15.3820644Z 
2025-05-07T20:24:15.3820649Z 
2025-05-07T20:24:15.3820654Z 
2025-05-07T20:24:15.3820659Z 
2025-05-07T20:24:15.3820665Z 
2025-05-07T20:24:15.3820670Z 
2025-05-07T20:24:15.3820675Z 
2025-05-07T20:24:15.3820691Z 
2025-05-07T20:24:15.3820696Z 
2025-05-07T20:24:15.3820702Z 
2025-05-07T20:24:15.3820707Z 
2025-05-07T20:24:15.3820712Z 
2025-05-07T20:24:15.3820717Z 
2025-05-07T20:24:15.3820722Z 
2025-05-07T20:24:15.3841620Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.3841998Z 
2025-05-07T20:24:15.3842002Z 
2025-05-07T20:24:15.3842006Z 
2025-05-07T20:24:15.3842019Z 
2025-05-07T20:24:15.3842023Z 
2025-05-07T20:24:15.3842027Z 
2025-05-07T20:24:15.3842038Z 
2025-05-07T20:24:15.3842042Z 
2025-05-07T20:24:15.3842045Z 
2025-05-07T20:24:15.3842049Z 
2025-05-07T20:24:15.3842052Z 
2025-05-07T20:24:15.3842056Z 
2025-05-07T20:24:15.3842060Z 
2025-05-07T20:24:15.3842063Z 
2025-05-07T20:24:15.3842067Z 
2025-05-07T20:24:15.3842070Z 
2025-05-07T20:24:15.3842074Z 
2025-05-07T20:24:15.3842077Z 
2025-05-07T20:24:15.3842081Z 
2025-05-07T20:24:15.4280988Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.4281359Z 
2025-05-07T20:24:15.4281405Z 
2025-05-07T20:24:15.4281411Z 
2025-05-07T20:24:15.4281415Z 
2025-05-07T20:24:15.4422054Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:15.4556665Z python-3.12.2        | 30.8 MB   | ##2        |  23% 
2025-05-07T20:24:15.4556984Z 
2025-05-07T20:24:15.4556989Z 
2025-05-07T20:24:15.4557000Z 
2025-05-07T20:24:15.4557005Z 
2025-05-07T20:24:15.4557239Z 
2025-05-07T20:24:15.4557243Z 
2025-05-07T20:24:15.4565349Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:15.4565716Z 
2025-05-07T20:24:15.4565727Z 
2025-05-07T20:24:15.4565730Z 
2025-05-07T20:24:15.4565734Z 
2025-05-07T20:24:15.4565738Z 
2025-05-07T20:24:15.4568786Z 
2025-05-07T20:24:15.4903733Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:15.4904019Z 
2025-05-07T20:24:15.4904023Z 
2025-05-07T20:24:15.4904027Z 
2025-05-07T20:24:15.4904030Z 
2025-05-07T20:24:15.4904035Z 
2025-05-07T20:24:15.4904038Z 
2025-05-07T20:24:15.4904314Z 
2025-05-07T20:24:15.4915728Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:15.4916041Z 
2025-05-07T20:24:15.4916045Z 
2025-05-07T20:24:15.4916048Z 
2025-05-07T20:24:15.4916052Z 
2025-05-07T20:24:15.4916056Z 
2025-05-07T20:24:15.4916059Z 
2025-05-07T20:24:15.4916063Z 
2025-05-07T20:24:15.5181279Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:15.5181711Z 
2025-05-07T20:24:15.5181718Z 
2025-05-07T20:24:15.5181723Z 
2025-05-07T20:24:15.5181728Z 
2025-05-07T20:24:15.5181733Z 
2025-05-07T20:24:15.5181739Z 
2025-05-07T20:24:15.5181744Z 
2025-05-07T20:24:15.5183692Z 
2025-05-07T20:24:15.5188097Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:15.5188441Z 
2025-05-07T20:24:15.5188446Z 
2025-05-07T20:24:15.5188451Z 
2025-05-07T20:24:15.5188456Z 
2025-05-07T20:24:15.5188461Z 
2025-05-07T20:24:15.5188466Z 
2025-05-07T20:24:15.5188471Z 
2025-05-07T20:24:15.5188476Z 
2025-05-07T20:24:15.5422704Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:15.5756206Z python-3.12.2        | 30.8 MB   | ###4       |  35% 
2025-05-07T20:24:15.5756575Z 
2025-05-07T20:24:15.5756579Z 
2025-05-07T20:24:15.5756583Z 
2025-05-07T20:24:15.5756586Z 
2025-05-07T20:24:15.5756590Z 
2025-05-07T20:24:15.5756594Z 
2025-05-07T20:24:15.5756597Z 
2025-05-07T20:24:15.5756615Z 
2025-05-07T20:24:15.5756618Z 
2025-05-07T20:24:15.5756622Z 
2025-05-07T20:24:15.5759659Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.5759962Z 
2025-05-07T20:24:15.5759968Z 
2025-05-07T20:24:15.5759973Z 
2025-05-07T20:24:15.5759978Z 
2025-05-07T20:24:15.5759983Z 
2025-05-07T20:24:15.5759988Z 
2025-05-07T20:24:15.5759993Z 
2025-05-07T20:24:15.5759998Z 
2025-05-07T20:24:15.5760003Z 
2025-05-07T20:24:15.5760167Z 
2025-05-07T20:24:15.6245790Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.6246181Z 
2025-05-07T20:24:15.6246185Z 
2025-05-07T20:24:15.6246189Z 
2025-05-07T20:24:15.6246193Z 
2025-05-07T20:24:15.6246198Z 
2025-05-07T20:24:15.6246209Z 
2025-05-07T20:24:15.6246212Z 
2025-05-07T20:24:15.6246216Z 
2025-05-07T20:24:15.6246220Z 
2025-05-07T20:24:15.6246223Z 
2025-05-07T20:24:15.6246865Z 
2025-05-07T20:24:15.6252998Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.6253369Z 
2025-05-07T20:24:15.6253376Z 
2025-05-07T20:24:15.6253381Z 
2025-05-07T20:24:15.6253386Z 
2025-05-07T20:24:15.6253391Z 
2025-05-07T20:24:15.6253395Z 
2025-05-07T20:24:15.6253400Z 
2025-05-07T20:24:15.6253405Z 
2025-05-07T20:24:15.6253411Z 
2025-05-07T20:24:15.6253416Z 
2025-05-07T20:24:15.6253421Z 
2025-05-07T20:24:15.6423669Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.6559093Z python-3.12.2        | 30.8 MB   | ####7      |  47% 
2025-05-07T20:24:15.6559369Z 
2025-05-07T20:24:15.6559373Z 
2025-05-07T20:24:15.6559388Z 
2025-05-07T20:24:15.6559391Z 
2025-05-07T20:24:15.6559395Z 
2025-05-07T20:24:15.6559399Z 
2025-05-07T20:24:15.6559403Z 
2025-05-07T20:24:15.6559406Z 
2025-05-07T20:24:15.6559410Z 
2025-05-07T20:24:15.6559414Z 
2025-05-07T20:24:15.6559418Z 
2025-05-07T20:24:15.6559702Z 
2025-05-07T20:24:15.6565876Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.6566467Z 
2025-05-07T20:24:15.6566472Z 
2025-05-07T20:24:15.6566476Z 
2025-05-07T20:24:15.6566480Z 
2025-05-07T20:24:15.6566483Z 
2025-05-07T20:24:15.6566487Z 
2025-05-07T20:24:15.6566491Z 
2025-05-07T20:24:15.6566494Z 
2025-05-07T20:24:15.6566498Z 
2025-05-07T20:24:15.6566502Z 
2025-05-07T20:24:15.6566506Z 
2025-05-07T20:24:15.6568525Z 
2025-05-07T20:24:15.7425665Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.7687534Z python-3.12.2        | 30.8 MB   | ######3    |  63% 
2025-05-07T20:24:15.7687802Z 
2025-05-07T20:24:15.7688068Z 
2025-05-07T20:24:15.7688075Z 
2025-05-07T20:24:15.7688080Z 
2025-05-07T20:24:15.7688085Z 
2025-05-07T20:24:15.7688091Z 
2025-05-07T20:24:15.7688096Z 
2025-05-07T20:24:15.7688102Z 
2025-05-07T20:24:15.7688113Z 
2025-05-07T20:24:15.7701108Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.7701456Z 
2025-05-07T20:24:15.7701460Z 
2025-05-07T20:24:15.7701479Z 
2025-05-07T20:24:15.7701483Z 
2025-05-07T20:24:15.7701487Z 
2025-05-07T20:24:15.7701490Z 
2025-05-07T20:24:15.7701506Z 
2025-05-07T20:24:15.7701510Z 
2025-05-07T20:24:15.7701915Z 
2025-05-07T20:24:15.7982550Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.7982884Z 
2025-05-07T20:24:15.7982900Z 
2025-05-07T20:24:15.7995297Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:24:15.7995709Z 
2025-05-07T20:24:15.7995715Z 
2025-05-07T20:24:15.8024912Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:24:15.8025300Z 
2025-05-07T20:24:15.8025305Z 
2025-05-07T20:24:15.8025309Z 
2025-05-07T20:24:15.8025312Z 
2025-05-07T20:24:15.8025316Z 
2025-05-07T20:24:15.8025320Z 
2025-05-07T20:24:15.8025323Z 
2025-05-07T20:24:15.8025334Z 
2025-05-07T20:24:15.8025338Z 
2025-05-07T20:24:15.8025341Z 
2025-05-07T20:24:15.8025345Z 
2025-05-07T20:24:15.8025348Z 
2025-05-07T20:24:15.8025352Z 
2025-05-07T20:24:15.8033598Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8033958Z 
2025-05-07T20:24:15.8033964Z 
2025-05-07T20:24:15.8033969Z 
2025-05-07T20:24:15.8033974Z 
2025-05-07T20:24:15.8033980Z 
2025-05-07T20:24:15.8033985Z 
2025-05-07T20:24:15.8033989Z 
2025-05-07T20:24:15.8033994Z 
2025-05-07T20:24:15.8033999Z 
2025-05-07T20:24:15.8034004Z 
2025-05-07T20:24:15.8034009Z 
2025-05-07T20:24:15.8034014Z 
2025-05-07T20:24:15.8034019Z 
2025-05-07T20:24:15.8345102Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8345486Z 
2025-05-07T20:24:15.8345492Z 
2025-05-07T20:24:15.8345497Z 
2025-05-07T20:24:15.8345502Z 
2025-05-07T20:24:15.8345507Z 
2025-05-07T20:24:15.8345512Z 
2025-05-07T20:24:15.8345517Z 
2025-05-07T20:24:15.8345522Z 
2025-05-07T20:24:15.8345528Z 
2025-05-07T20:24:15.8345533Z 
2025-05-07T20:24:15.8345538Z 
2025-05-07T20:24:15.8345545Z 
2025-05-07T20:24:15.8345550Z 
2025-05-07T20:24:15.8347520Z 
2025-05-07T20:24:15.8354031Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8354345Z 
2025-05-07T20:24:15.8354349Z 
2025-05-07T20:24:15.8354360Z 
2025-05-07T20:24:15.8354364Z 
2025-05-07T20:24:15.8354367Z 
2025-05-07T20:24:15.8354371Z 
2025-05-07T20:24:15.8354375Z 
2025-05-07T20:24:15.8354378Z 
2025-05-07T20:24:15.8354382Z 
2025-05-07T20:24:15.8354388Z 
2025-05-07T20:24:15.8354393Z 
2025-05-07T20:24:15.8354398Z 
2025-05-07T20:24:15.8354403Z 
2025-05-07T20:24:15.8354408Z 
2025-05-07T20:24:15.8381737Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8382199Z 
2025-05-07T20:24:15.8382205Z 
2025-05-07T20:24:15.8382211Z 
2025-05-07T20:24:15.8382216Z 
2025-05-07T20:24:15.8382221Z 
2025-05-07T20:24:15.8382226Z 
2025-05-07T20:24:15.8382231Z 
2025-05-07T20:24:15.8382236Z 
2025-05-07T20:24:15.8382241Z 
2025-05-07T20:24:15.8382246Z 
2025-05-07T20:24:15.8382252Z 
2025-05-07T20:24:15.8382539Z 
2025-05-07T20:24:15.8382545Z 
2025-05-07T20:24:15.8382550Z 
2025-05-07T20:24:15.8385611Z 
2025-05-07T20:24:15.8392002Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8392464Z 
2025-05-07T20:24:15.8392468Z 
2025-05-07T20:24:15.8392472Z 
2025-05-07T20:24:15.8392476Z 
2025-05-07T20:24:15.8392488Z 
2025-05-07T20:24:15.8392492Z 
2025-05-07T20:24:15.8392495Z 
2025-05-07T20:24:15.8392499Z 
2025-05-07T20:24:15.8392503Z 
2025-05-07T20:24:15.8392509Z 
2025-05-07T20:24:15.8392514Z 
2025-05-07T20:24:15.8392519Z 
2025-05-07T20:24:15.8392523Z 
2025-05-07T20:24:15.8392765Z 
2025-05-07T20:24:15.8395950Z 
2025-05-07T20:24:15.8428259Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8854105Z python-3.12.2        | 30.8 MB   | ########   |  81% 
2025-05-07T20:24:15.8854407Z 
2025-05-07T20:24:15.8854413Z 
2025-05-07T20:24:15.8854419Z 
2025-05-07T20:24:15.8854423Z 
2025-05-07T20:24:15.8854447Z 
2025-05-07T20:24:15.8854452Z 
2025-05-07T20:24:15.8854457Z 
2025-05-07T20:24:15.8854462Z 
2025-05-07T20:24:15.8854466Z 
2025-05-07T20:24:15.8854471Z 
2025-05-07T20:24:15.8854477Z 
2025-05-07T20:24:15.8854482Z 
2025-05-07T20:24:15.8854487Z 
2025-05-07T20:24:15.8854492Z 
2025-05-07T20:24:15.8854497Z 
2025-05-07T20:24:15.8854503Z 
2025-05-07T20:24:15.8858516Z 
2025-05-07T20:24:15.8868865Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.8869204Z 
2025-05-07T20:24:15.8869209Z 
2025-05-07T20:24:15.8869214Z 
2025-05-07T20:24:15.8869231Z 
2025-05-07T20:24:15.8869236Z 
2025-05-07T20:24:15.8869242Z 
2025-05-07T20:24:15.8869255Z 
2025-05-07T20:24:15.8869260Z 
2025-05-07T20:24:15.8869265Z 
2025-05-07T20:24:15.8869270Z 
2025-05-07T20:24:15.8869275Z 
2025-05-07T20:24:15.8869281Z 
2025-05-07T20:24:15.8869286Z 
2025-05-07T20:24:15.8869291Z 
2025-05-07T20:24:15.8869296Z 
2025-05-07T20:24:15.8869301Z 
2025-05-07T20:24:15.8870778Z 
2025-05-07T20:24:15.9334417Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9334901Z 
2025-05-07T20:24:15.9334907Z 
2025-05-07T20:24:15.9334912Z 
2025-05-07T20:24:15.9334917Z 
2025-05-07T20:24:15.9334923Z 
2025-05-07T20:24:15.9334928Z 
2025-05-07T20:24:15.9334933Z 
2025-05-07T20:24:15.9334938Z 
2025-05-07T20:24:15.9334943Z 
2025-05-07T20:24:15.9334948Z 
2025-05-07T20:24:15.9334955Z 
2025-05-07T20:24:15.9334960Z 
2025-05-07T20:24:15.9334965Z 
2025-05-07T20:24:15.9334971Z 
2025-05-07T20:24:15.9334978Z 
2025-05-07T20:24:15.9334985Z 
2025-05-07T20:24:15.9341889Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9342204Z 
2025-05-07T20:24:15.9342209Z 
2025-05-07T20:24:15.9342213Z 
2025-05-07T20:24:15.9342216Z 
2025-05-07T20:24:15.9342227Z 
2025-05-07T20:24:15.9342230Z 
2025-05-07T20:24:15.9342234Z 
2025-05-07T20:24:15.9342237Z 
2025-05-07T20:24:15.9342241Z 
2025-05-07T20:24:15.9342256Z 
2025-05-07T20:24:15.9342262Z 
2025-05-07T20:24:15.9342266Z 
2025-05-07T20:24:15.9342271Z 
2025-05-07T20:24:15.9342276Z 
2025-05-07T20:24:15.9342281Z 
2025-05-07T20:24:15.9342286Z 
2025-05-07T20:24:15.9411218Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9411535Z 
2025-05-07T20:24:15.9411539Z 
2025-05-07T20:24:15.9411543Z 
2025-05-07T20:24:15.9411546Z 
2025-05-07T20:24:15.9411550Z 
2025-05-07T20:24:15.9411561Z 
2025-05-07T20:24:15.9411565Z 
2025-05-07T20:24:15.9411568Z 
2025-05-07T20:24:15.9411572Z 
2025-05-07T20:24:15.9411591Z 
2025-05-07T20:24:15.9411595Z 
2025-05-07T20:24:15.9411599Z 
2025-05-07T20:24:15.9411603Z 
2025-05-07T20:24:15.9411606Z 
2025-05-07T20:24:15.9411610Z 
2025-05-07T20:24:15.9411614Z 
2025-05-07T20:24:15.9411617Z 
2025-05-07T20:24:15.9411621Z 
2025-05-07T20:24:15.9416330Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9416868Z 
2025-05-07T20:24:15.9416873Z 
2025-05-07T20:24:15.9416877Z 
2025-05-07T20:24:15.9416881Z 
2025-05-07T20:24:15.9416884Z 
2025-05-07T20:24:15.9416888Z 
2025-05-07T20:24:15.9416891Z 
2025-05-07T20:24:15.9416895Z 
2025-05-07T20:24:15.9416899Z 
2025-05-07T20:24:15.9416902Z 
2025-05-07T20:24:15.9416906Z 
2025-05-07T20:24:15.9416909Z 
2025-05-07T20:24:15.9416913Z 
2025-05-07T20:24:15.9416917Z 
2025-05-07T20:24:15.9416926Z 
2025-05-07T20:24:15.9416930Z 
2025-05-07T20:24:15.9416933Z 
2025-05-07T20:24:15.9416937Z 
2025-05-07T20:24:15.9442053Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9555925Z python-3.12.2        | 30.8 MB   | #########4 |  95% 
2025-05-07T20:24:15.9556270Z 
2025-05-07T20:24:15.9556277Z 
2025-05-07T20:24:15.9556282Z 
2025-05-07T20:24:15.9556286Z 
2025-05-07T20:24:15.9556309Z 
2025-05-07T20:24:15.9556321Z 
2025-05-07T20:24:15.9556325Z 
2025-05-07T20:24:15.9556328Z 
2025-05-07T20:24:15.9556332Z 
2025-05-07T20:24:15.9556347Z 
2025-05-07T20:24:15.9556351Z 
2025-05-07T20:24:15.9556355Z 
2025-05-07T20:24:15.9556358Z 
2025-05-07T20:24:15.9556362Z 
2025-05-07T20:24:15.9556536Z 
2025-05-07T20:24:15.9556542Z 
2025-05-07T20:24:15.9556545Z 
2025-05-07T20:24:15.9556550Z 
2025-05-07T20:24:15.9556730Z 
2025-05-07T20:24:15.9900650Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:15.9900967Z 
2025-05-07T20:24:16.0032425Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:24:16.6906565Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:24:16.6913035Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:24:16.6913313Z 
2025-05-07T20:24:16.6913320Z 
2025-05-07T20:24:16.6913324Z 
2025-05-07T20:24:16.6913338Z 
2025-05-07T20:24:16.6913343Z 
2025-05-07T20:24:16.6913348Z 
2025-05-07T20:24:16.6913352Z 
2025-05-07T20:24:16.6913357Z 
2025-05-07T20:24:16.6913361Z 
2025-05-07T20:24:16.6913366Z 
2025-05-07T20:24:16.6913370Z 
2025-05-07T20:24:16.6913386Z 
2025-05-07T20:24:16.6913390Z 
2025-05-07T20:24:16.6913393Z 
2025-05-07T20:24:16.6913397Z 
2025-05-07T20:24:16.6913400Z 
2025-05-07T20:24:16.6913404Z 
2025-05-07T20:24:16.6913407Z 
2025-05-07T20:24:16.6913411Z 
2025-05-07T20:24:16.6913504Z                       
2025-05-07T20:24:16.6914003Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6914332Z                                                      
2025-05-07T20:24:16.6914527Z 
2025-05-07T20:24:16.6914698Z                                                      [A
2025-05-07T20:24:16.6914896Z 
2025-05-07T20:24:16.6914900Z 
2025-05-07T20:24:16.6915071Z                                                      [A[A
2025-05-07T20:24:16.6915285Z 
2025-05-07T20:24:16.6915289Z 
2025-05-07T20:24:16.6915316Z 
2025-05-07T20:24:16.6915488Z                                                      [A[A[A
2025-05-07T20:24:16.6915707Z 
2025-05-07T20:24:16.6915712Z 
2025-05-07T20:24:16.6915716Z 
2025-05-07T20:24:16.6915721Z 
2025-05-07T20:24:16.6915926Z                                                      [A[A[A[A
2025-05-07T20:24:16.6916150Z 
2025-05-07T20:24:16.6916154Z 
2025-05-07T20:24:16.6916158Z 
2025-05-07T20:24:16.6916161Z 
2025-05-07T20:24:16.6916165Z 
2025-05-07T20:24:16.6916338Z                                                      [A[A[A[A[A
2025-05-07T20:24:16.6916557Z 
2025-05-07T20:24:16.6916560Z 
2025-05-07T20:24:16.6916564Z 
2025-05-07T20:24:16.6916568Z 
2025-05-07T20:24:16.6916571Z 
2025-05-07T20:24:16.6916575Z 
2025-05-07T20:24:16.6916752Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:16.6916977Z 
2025-05-07T20:24:16.6916981Z 
2025-05-07T20:24:16.6916984Z 
2025-05-07T20:24:16.6916988Z 
2025-05-07T20:24:16.6916991Z 
2025-05-07T20:24:16.6916995Z 
2025-05-07T20:24:16.6916999Z 
2025-05-07T20:24:16.6917175Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:16.6917407Z 
2025-05-07T20:24:16.6917411Z 
2025-05-07T20:24:16.6917414Z 
2025-05-07T20:24:16.6917620Z 
2025-05-07T20:24:16.6917624Z 
2025-05-07T20:24:16.6917628Z 
2025-05-07T20:24:16.6917631Z 
2025-05-07T20:24:16.6917635Z 
2025-05-07T20:24:16.6917819Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6918045Z 
2025-05-07T20:24:16.6918049Z 
2025-05-07T20:24:16.6918052Z 
2025-05-07T20:24:16.6918056Z 
2025-05-07T20:24:16.6918060Z 
2025-05-07T20:24:16.6918070Z 
2025-05-07T20:24:16.6918074Z 
2025-05-07T20:24:16.6918077Z 
2025-05-07T20:24:16.6918081Z 
2025-05-07T20:24:16.6918423Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6918641Z 
2025-05-07T20:24:16.6918644Z 
2025-05-07T20:24:16.6918648Z 
2025-05-07T20:24:16.6918657Z 
2025-05-07T20:24:16.6918661Z 
2025-05-07T20:24:16.6918664Z 
2025-05-07T20:24:16.6918668Z 
2025-05-07T20:24:16.6918671Z 
2025-05-07T20:24:16.6918675Z 
2025-05-07T20:24:16.6918678Z 
2025-05-07T20:24:16.6918868Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6919099Z 
2025-05-07T20:24:16.6919103Z 
2025-05-07T20:24:16.6919107Z 
2025-05-07T20:24:16.6919110Z 
2025-05-07T20:24:16.6919114Z 
2025-05-07T20:24:16.6919117Z 
2025-05-07T20:24:16.6919121Z 
2025-05-07T20:24:16.6919125Z 
2025-05-07T20:24:16.6919185Z 
2025-05-07T20:24:16.6919189Z 
2025-05-07T20:24:16.6919192Z 
2025-05-07T20:24:16.6919388Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6919608Z 
2025-05-07T20:24:16.6919612Z 
2025-05-07T20:24:16.6919616Z 
2025-05-07T20:24:16.6919619Z 
2025-05-07T20:24:16.6919623Z 
2025-05-07T20:24:16.6919632Z 
2025-05-07T20:24:16.6919636Z 
2025-05-07T20:24:16.6919639Z 
2025-05-07T20:24:16.6919643Z 
2025-05-07T20:24:16.6919646Z 
2025-05-07T20:24:16.6919650Z 
2025-05-07T20:24:16.6919653Z 
2025-05-07T20:24:16.6919852Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6920076Z 
2025-05-07T20:24:16.6920079Z 
2025-05-07T20:24:16.6920088Z 
2025-05-07T20:24:16.6920092Z 
2025-05-07T20:24:16.6920095Z 
2025-05-07T20:24:16.6920099Z 
2025-05-07T20:24:16.6920108Z 
2025-05-07T20:24:16.6920111Z 
2025-05-07T20:24:16.6920115Z 
2025-05-07T20:24:16.6920119Z 
2025-05-07T20:24:16.6920122Z 
2025-05-07T20:24:16.6920126Z 
2025-05-07T20:24:16.6920129Z 
2025-05-07T20:24:16.6920325Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6920554Z 
2025-05-07T20:24:16.6920557Z 
2025-05-07T20:24:16.6920561Z 
2025-05-07T20:24:16.6920565Z 
2025-05-07T20:24:16.6920568Z 
2025-05-07T20:24:16.6920572Z 
2025-05-07T20:24:16.6920580Z 
2025-05-07T20:24:16.6920584Z 
2025-05-07T20:24:16.6920588Z 
2025-05-07T20:24:16.6920591Z 
2025-05-07T20:24:16.6920595Z 
2025-05-07T20:24:16.6920599Z 
2025-05-07T20:24:16.6920602Z 
2025-05-07T20:24:16.6920606Z 
2025-05-07T20:24:16.6920810Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6921042Z 
2025-05-07T20:24:16.6921050Z 
2025-05-07T20:24:16.6921054Z 
2025-05-07T20:24:16.6921057Z 
2025-05-07T20:24:16.6921061Z 
2025-05-07T20:24:16.6921064Z 
2025-05-07T20:24:16.6921068Z 
2025-05-07T20:24:16.6921071Z 
2025-05-07T20:24:16.6921075Z 
2025-05-07T20:24:16.6921079Z 
2025-05-07T20:24:16.6921082Z 
2025-05-07T20:24:16.6921086Z 
2025-05-07T20:24:16.6921089Z 
2025-05-07T20:24:16.6921093Z 
2025-05-07T20:24:16.6921096Z 
2025-05-07T20:24:16.6921308Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6921536Z 
2025-05-07T20:24:16.6921540Z 
2025-05-07T20:24:16.6921544Z 
2025-05-07T20:24:16.6921551Z 
2025-05-07T20:24:16.6921555Z 
2025-05-07T20:24:16.6921559Z 
2025-05-07T20:24:16.6921563Z 
2025-05-07T20:24:16.6921572Z 
2025-05-07T20:24:16.6921576Z 
2025-05-07T20:24:16.6921579Z 
2025-05-07T20:24:16.6921583Z 
2025-05-07T20:24:16.6921586Z 
2025-05-07T20:24:16.6921590Z 
2025-05-07T20:24:16.6921593Z 
2025-05-07T20:24:16.6921597Z 
2025-05-07T20:24:16.6921600Z 
2025-05-07T20:24:16.6921893Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6922131Z 
2025-05-07T20:24:16.6922135Z 
2025-05-07T20:24:16.6922139Z 
2025-05-07T20:24:16.6922142Z 
2025-05-07T20:24:16.6922146Z 
2025-05-07T20:24:16.6922149Z 
2025-05-07T20:24:16.6922153Z 
2025-05-07T20:24:16.6922156Z 
2025-05-07T20:24:16.6922160Z 
2025-05-07T20:24:16.6922163Z 
2025-05-07T20:24:16.6922167Z 
2025-05-07T20:24:16.6922170Z 
2025-05-07T20:24:16.6922174Z 
2025-05-07T20:24:16.6922177Z 
2025-05-07T20:24:16.6922181Z 
2025-05-07T20:24:16.6922184Z 
2025-05-07T20:24:16.6922274Z 
2025-05-07T20:24:16.6922497Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6922728Z 
2025-05-07T20:24:16.6922732Z 
2025-05-07T20:24:16.6922735Z 
2025-05-07T20:24:16.6922739Z 
2025-05-07T20:24:16.6922743Z 
2025-05-07T20:24:16.6922746Z 
2025-05-07T20:24:16.6922750Z 
2025-05-07T20:24:16.6922753Z 
2025-05-07T20:24:16.6922757Z 
2025-05-07T20:24:16.6922765Z 
2025-05-07T20:24:16.6922773Z 
2025-05-07T20:24:16.6922777Z 
2025-05-07T20:24:16.6922780Z 
2025-05-07T20:24:16.6922784Z 
2025-05-07T20:24:16.6922788Z 
2025-05-07T20:24:16.6922791Z 
2025-05-07T20:24:16.6922795Z 
2025-05-07T20:24:16.6922798Z 
2025-05-07T20:24:16.6923015Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:16.6923258Z 
2025-05-07T20:24:16.6923338Z  done
2025-05-07T20:24:16.7936786Z Preparing transaction: \ done
2025-05-07T20:24:17.5420533Z Verifying transaction: / - \ | / - \ done
2025-05-07T20:24:19.1450508Z Executing transaction: / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:19.4967480Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:21.2406096Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:21.2419541Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:21.2442573Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:22.1089260Z Channels:
2025-05-07T20:24:22.1089517Z  - conda-forge
2025-05-07T20:24:22.1089834Z Platform: linux-64
2025-05-07T20:24:25.4819729Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:25.8526875Z Solving environment: \ done
2025-05-07T20:24:25.8895246Z 
2025-05-07T20:24:25.8895574Z # All requested packages already installed.
2025-05-07T20:24:25.8895842Z 
2025-05-07T20:24:29.2395078Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:29.2395929Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h
2025-05-07T20:24:29.2396504Z 
2025-05-07T20:24:29.2421204Z 
2025-05-07T20:24:30.8662996Z [SETUP] Installed Python version: Python 3.12.2
2025-05-07T20:24:30.8663462Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:30.8706775Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:30.8707420Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:30.8719215Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:30.8719566Z env:
2025-05-07T20:24:30.8719796Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:30.8720104Z   BUILD_ENV: build_binary
2025-05-07T20:24:30.8720391Z   BUILD_TARGET: genai
2025-05-07T20:24:30.8720626Z   BUILD_VARIANT: cuda
2025-05-07T20:24:30.8720864Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:30.8721118Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:30.8721425Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:30.8721760Z ##[endgroup]
2025-05-07T20:24:31.2085589Z ################################################################################
2025-05-07T20:24:31.2085966Z # Install C/C++ Compilers
2025-05-07T20:24:31.2086209Z #
2025-05-07T20:24:31.2101756Z # [2025-05-07T20:24:31.209Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:31.2102290Z ################################################################################
2025-05-07T20:24:31.2102872Z 
2025-05-07T20:24:31.2118918Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:31.3002005Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:31.3014718Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:31.3035696Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:32.1668566Z Channels:
2025-05-07T20:24:32.1668875Z  - conda-forge
2025-05-07T20:24:32.1669100Z Platform: linux-64
2025-05-07T20:24:35.4257018Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:35.7901087Z Solving environment: \ done
2025-05-07T20:24:35.8540621Z 
2025-05-07T20:24:35.8540937Z ## Package Plan ##
2025-05-07T20:24:35.8541113Z 
2025-05-07T20:24:35.8541367Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:35.8541677Z 
2025-05-07T20:24:35.8541792Z   added / updated specs:
2025-05-07T20:24:35.8542062Z     - sysroot_linux-64=2.17
2025-05-07T20:24:35.8542230Z 
2025-05-07T20:24:35.8542244Z 
2025-05-07T20:24:35.8542366Z The following packages will be downloaded:
2025-05-07T20:24:35.8542585Z 
2025-05-07T20:24:35.8542708Z     package                    |            build
2025-05-07T20:24:35.8543028Z     ---------------------------|-----------------
2025-05-07T20:24:35.8543460Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:35.8544063Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:35.8544475Z     ------------------------------------------------------------
2025-05-07T20:24:35.8544825Z                                            Total:        15.4 MB
2025-05-07T20:24:35.8545041Z 
2025-05-07T20:24:35.8545169Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:35.8545397Z 
2025-05-07T20:24:35.8545693Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:35.8546262Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:35.8546586Z 
2025-05-07T20:24:35.8546590Z 
2025-05-07T20:24:35.8546594Z 
2025-05-07T20:24:35.8546737Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:35.8547111Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:35.8547341Z 
2025-05-07T20:24:35.9544038Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:36.0067038Z sysroot_linux-64-2.1 | 14.5 MB   | #8         |  19% 
2025-05-07T20:24:36.0067291Z 
2025-05-07T20:24:36.0154599Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:36.0157474Z 
2025-05-07T20:24:36.0526475Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:36.3231605Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:36.3232291Z 
2025-05-07T20:24:36.3232705Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:36.3233012Z 
2025-05-07T20:24:36.7284848Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:36.7285988Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:36.7290622Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:36.7291275Z                                                      
2025-05-07T20:24:36.7291615Z 
2025-05-07T20:24:36.7291862Z                                                      [A done
2025-05-07T20:24:36.8295396Z Preparing transaction: / done
2025-05-07T20:24:37.0309755Z Verifying transaction: \ | done
2025-05-07T20:24:37.2361275Z Executing transaction: - \ done
2025-05-07T20:24:37.3906414Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:37.3906806Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:39.0610808Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:39.0623218Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:39.0646439Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:39.9518700Z Channels:
2025-05-07T20:24:39.9518946Z  - conda-forge
2025-05-07T20:24:39.9519168Z Platform: linux-64
2025-05-07T20:24:43.2247185Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:44.1841890Z Solving environment: \ | / done
2025-05-07T20:24:44.2494374Z 
2025-05-07T20:24:44.2494764Z ## Package Plan ##
2025-05-07T20:24:44.2494955Z 
2025-05-07T20:24:44.2495178Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:44.2495494Z 
2025-05-07T20:24:44.2495592Z   added / updated specs:
2025-05-07T20:24:44.2495859Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:44.2496027Z 
2025-05-07T20:24:44.2496032Z 
2025-05-07T20:24:44.2496154Z The following packages will be downloaded:
2025-05-07T20:24:44.2496398Z 
2025-05-07T20:24:44.2496522Z     package                    |            build
2025-05-07T20:24:44.2496968Z     ---------------------------|-----------------
2025-05-07T20:24:44.2497543Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:44.2498215Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:44.2498859Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:44.2499436Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:44.2500015Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:44.2500471Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:44.2500908Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:44.2501391Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:44.2501876Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:44.2502334Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:44.2503200Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:44.2503862Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:44.2504274Z     ------------------------------------------------------------
2025-05-07T20:24:44.2504619Z                                            Total:        91.6 MB
2025-05-07T20:24:44.2504834Z 
2025-05-07T20:24:44.2504962Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:44.2505187Z 
2025-05-07T20:24:44.2505459Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:44.2506448Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:44.2507582Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:44.2508126Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:44.2508639Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:44.2509149Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:44.2509683Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:44.2510271Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:44.2510775Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:44.2511323Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:44.2511691Z 
2025-05-07T20:24:44.2511805Z The following packages will be UPDATED:
2025-05-07T20:24:44.2512017Z 
2025-05-07T20:24:44.2512342Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:44.2513236Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:44.2513653Z 
2025-05-07T20:24:44.2513658Z 
2025-05-07T20:24:44.2513668Z 
2025-05-07T20:24:44.2513825Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:44.2514235Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:44.2514465Z 
2025-05-07T20:24:44.2514829Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:44.2515078Z 
2025-05-07T20:24:44.2515082Z 
2025-05-07T20:24:44.2528082Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:44.2528468Z 
2025-05-07T20:24:44.2528504Z 
2025-05-07T20:24:44.2528533Z 
2025-05-07T20:24:44.2535072Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:44.2535460Z 
2025-05-07T20:24:44.2535466Z 
2025-05-07T20:24:44.2535471Z 
2025-05-07T20:24:44.2540765Z 
2025-05-07T20:24:44.2577111Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:44.2577506Z 
2025-05-07T20:24:44.2577512Z 
2025-05-07T20:24:44.2577517Z 
2025-05-07T20:24:44.2577522Z 
2025-05-07T20:24:44.2586311Z 
2025-05-07T20:24:44.2605917Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:44.2606519Z 
2025-05-07T20:24:44.2606525Z 
2025-05-07T20:24:44.2606530Z 
2025-05-07T20:24:44.2606535Z 
2025-05-07T20:24:44.2606540Z 
2025-05-07T20:24:44.2606546Z 
2025-05-07T20:24:44.2607513Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:44.2607916Z 
2025-05-07T20:24:44.2607921Z 
2025-05-07T20:24:44.2607926Z 
2025-05-07T20:24:44.2607931Z 
2025-05-07T20:24:44.2607937Z 
2025-05-07T20:24:44.2607946Z 
2025-05-07T20:24:44.2607951Z 
2025-05-07T20:24:44.2609353Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:44.2609764Z 
2025-05-07T20:24:44.2609770Z 
2025-05-07T20:24:44.2609785Z 
2025-05-07T20:24:44.2609790Z 
2025-05-07T20:24:44.2609795Z 
2025-05-07T20:24:44.2609806Z 
2025-05-07T20:24:44.2609812Z 
2025-05-07T20:24:44.2609817Z 
2025-05-07T20:24:44.2611411Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:44.2611816Z 
2025-05-07T20:24:44.2611822Z 
2025-05-07T20:24:44.2611826Z 
2025-05-07T20:24:44.2611832Z 
2025-05-07T20:24:44.2611837Z 
2025-05-07T20:24:44.2611842Z 
2025-05-07T20:24:44.2611847Z 
2025-05-07T20:24:44.2611852Z 
2025-05-07T20:24:44.2611858Z 
2025-05-07T20:24:44.2625154Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:44.2625570Z 
2025-05-07T20:24:44.2625576Z 
2025-05-07T20:24:44.2625581Z 
2025-05-07T20:24:44.2625586Z 
2025-05-07T20:24:44.2625591Z 
2025-05-07T20:24:44.2625596Z 
2025-05-07T20:24:44.2625601Z 
2025-05-07T20:24:44.2625606Z 
2025-05-07T20:24:44.2625620Z 
2025-05-07T20:24:44.2625626Z 
2025-05-07T20:24:44.2626848Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:44.2627261Z 
2025-05-07T20:24:44.2627274Z 
2025-05-07T20:24:44.2627289Z 
2025-05-07T20:24:44.2627294Z 
2025-05-07T20:24:44.2627300Z 
2025-05-07T20:24:44.2627305Z 
2025-05-07T20:24:44.2627310Z 
2025-05-07T20:24:44.2627315Z 
2025-05-07T20:24:44.2627320Z 
2025-05-07T20:24:44.2627325Z 
2025-05-07T20:24:44.2627330Z 
2025-05-07T20:24:44.3885892Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:44.3886330Z 
2025-05-07T20:24:44.3886336Z 
2025-05-07T20:24:44.3887123Z 
2025-05-07T20:24:44.4573966Z binutils_impl_linux- | 6.0 MB    | 2          |   2% [A[A[A
2025-05-07T20:24:44.4574368Z 
2025-05-07T20:24:44.4574374Z 
2025-05-07T20:24:44.4574379Z 
2025-05-07T20:24:44.4574384Z 
2025-05-07T20:24:44.5113910Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:44.5114248Z 
2025-05-07T20:24:44.5114252Z 
2025-05-07T20:24:44.5117643Z 
2025-05-07T20:24:44.5515084Z binutils_impl_linux- | 6.0 MB    | 3          |   4% [A[A[A
2025-05-07T20:24:44.5515512Z 
2025-05-07T20:24:44.5579830Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:44.5580094Z 
2025-05-07T20:24:44.5580098Z 
2025-05-07T20:24:44.5580102Z 
2025-05-07T20:24:44.5580699Z 
2025-05-07T20:24:44.5601650Z libstdcxx-15.1.0     | 3.7 MB    | ###8       |  38% [A[A[A[A
2025-05-07T20:24:44.5727043Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:44.5727411Z 
2025-05-07T20:24:44.5727417Z 
2025-05-07T20:24:44.6117076Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:44.6117464Z 
2025-05-07T20:24:44.6117468Z 
2025-05-07T20:24:44.6117472Z 
2025-05-07T20:24:44.6515526Z binutils_impl_linux- | 6.0 MB    | ########6  |  86% [A[A[A
2025-05-07T20:24:44.6515914Z 
2025-05-07T20:24:44.6601739Z gxx_impl_linux-64-11 | 11.2 MB   | ###        |  31% [A
2025-05-07T20:24:44.6732594Z gcc_impl_linux-64-11 | 53.0 MB   | 6          |   6% 
2025-05-07T20:24:44.6732880Z 
2025-05-07T20:24:44.6736983Z 
2025-05-07T20:24:44.7048101Z libstdcxx-devel_linu | 11.1 MB   | ###        |  31% [A[A
2025-05-07T20:24:44.7048468Z 
2025-05-07T20:24:44.7048472Z 
2025-05-07T20:24:44.7048476Z 
2025-05-07T20:24:44.7050512Z 
2025-05-07T20:24:44.7052568Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:44.7052953Z 
2025-05-07T20:24:44.7052956Z 
2025-05-07T20:24:44.7052960Z 
2025-05-07T20:24:44.7054519Z 
2025-05-07T20:24:44.7516489Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:44.7518089Z 
2025-05-07T20:24:44.7604818Z gxx_impl_linux-64-11 | 11.2 MB   | #######2   |  72% [A
2025-05-07T20:24:44.7684412Z gcc_impl_linux-64-11 | 53.0 MB   | #6         |  16% 
2025-05-07T20:24:44.7684661Z 
2025-05-07T20:24:44.7684901Z 
2025-05-07T20:24:44.7689379Z 
2025-05-07T20:24:44.7768704Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:44.7769058Z 
2025-05-07T20:24:44.7769064Z 
2025-05-07T20:24:44.7769069Z 
2025-05-07T20:24:44.7769087Z 
2025-05-07T20:24:44.7769800Z 
2025-05-07T20:24:44.7971587Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:44.7971915Z 
2025-05-07T20:24:44.7974057Z 
2025-05-07T20:24:44.8262230Z libstdcxx-devel_linu | 11.1 MB   | ####8      |  49% [A[A
2025-05-07T20:24:44.8262617Z 
2025-05-07T20:24:44.8262625Z 
2025-05-07T20:24:44.8262765Z 
2025-05-07T20:24:44.8262770Z 
2025-05-07T20:24:44.8266195Z 
2025-05-07T20:24:44.8266201Z 
2025-05-07T20:24:44.8776157Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:44.8776583Z 
2025-05-07T20:24:44.8776590Z 
2025-05-07T20:24:44.8776595Z 
2025-05-07T20:24:44.8776600Z 
2025-05-07T20:24:44.8777345Z 
2025-05-07T20:24:44.8790117Z libsanitizer-11.4.0  | 3.5 MB    | ########1  |  82% [A[A[A[A[A
2025-05-07T20:24:44.8977075Z gcc_impl_linux-64-11 | 53.0 MB   | ##2        |  22% 
2025-05-07T20:24:44.8977423Z 
2025-05-07T20:24:44.8978182Z 
2025-05-07T20:24:44.9694013Z libstdcxx-devel_linu | 11.1 MB   | #######5   |  75% [A[A
2025-05-07T20:24:44.9694420Z 
2025-05-07T20:24:44.9694426Z 
2025-05-07T20:24:44.9694442Z 
2025-05-07T20:24:44.9694447Z 
2025-05-07T20:24:44.9694466Z 
2025-05-07T20:24:44.9694471Z 
2025-05-07T20:24:44.9694838Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:44.9695205Z 
2025-05-07T20:24:44.9695209Z 
2025-05-07T20:24:44.9695213Z 
2025-05-07T20:24:44.9695217Z 
2025-05-07T20:24:44.9695221Z 
2025-05-07T20:24:44.9695232Z 
2025-05-07T20:24:44.9805144Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:45.0184440Z gcc_impl_linux-64-11 | 53.0 MB   | ##8        |  29% 
2025-05-07T20:24:45.0184795Z 
2025-05-07T20:24:45.0184801Z 
2025-05-07T20:24:45.0184806Z 
2025-05-07T20:24:45.0184811Z 
2025-05-07T20:24:45.0184816Z 
2025-05-07T20:24:45.0184821Z 
2025-05-07T20:24:45.0187438Z 
2025-05-07T20:24:45.0242206Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:45.0242603Z 
2025-05-07T20:24:45.0242621Z 
2025-05-07T20:24:45.0242626Z 
2025-05-07T20:24:45.0242631Z 
2025-05-07T20:24:45.0242863Z 
2025-05-07T20:24:45.0709407Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:45.0709838Z 
2025-05-07T20:24:45.0709845Z 
2025-05-07T20:24:45.0709851Z 
2025-05-07T20:24:45.0709857Z 
2025-05-07T20:24:45.0709863Z 
2025-05-07T20:24:45.0709868Z 
2025-05-07T20:24:45.0709873Z 
2025-05-07T20:24:45.0775570Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:45.0775973Z 
2025-05-07T20:24:45.0775979Z 
2025-05-07T20:24:45.0775984Z 
2025-05-07T20:24:45.0775989Z 
2025-05-07T20:24:45.0775994Z 
2025-05-07T20:24:45.0775999Z 
2025-05-07T20:24:45.0776004Z 
2025-05-07T20:24:45.0776386Z 
2025-05-07T20:24:45.0811472Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:45.0818683Z gcc_impl_linux-64-11 | 53.0 MB   | ###4       |  35% 
2025-05-07T20:24:45.0819028Z 
2025-05-07T20:24:45.0819034Z 
2025-05-07T20:24:45.0819076Z 
2025-05-07T20:24:45.0819082Z 
2025-05-07T20:24:45.0819087Z 
2025-05-07T20:24:45.0819102Z 
2025-05-07T20:24:45.0819108Z 
2025-05-07T20:24:45.0819130Z 
2025-05-07T20:24:45.1384264Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1384687Z 
2025-05-07T20:24:45.1384694Z 
2025-05-07T20:24:45.1384699Z 
2025-05-07T20:24:45.1384704Z 
2025-05-07T20:24:45.1384709Z 
2025-05-07T20:24:45.1384714Z 
2025-05-07T20:24:45.1384719Z 
2025-05-07T20:24:45.1384724Z 
2025-05-07T20:24:45.1385976Z 
2025-05-07T20:24:45.1417358Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1417773Z 
2025-05-07T20:24:45.1417779Z 
2025-05-07T20:24:45.1417784Z 
2025-05-07T20:24:45.1417789Z 
2025-05-07T20:24:45.1417794Z 
2025-05-07T20:24:45.1417800Z 
2025-05-07T20:24:45.1417816Z 
2025-05-07T20:24:45.1417822Z 
2025-05-07T20:24:45.1417982Z 
2025-05-07T20:24:45.1433876Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1434273Z 
2025-05-07T20:24:45.1434279Z 
2025-05-07T20:24:45.1434291Z 
2025-05-07T20:24:45.1434297Z 
2025-05-07T20:24:45.1434302Z 
2025-05-07T20:24:45.1434307Z 
2025-05-07T20:24:45.1434312Z 
2025-05-07T20:24:45.1434329Z 
2025-05-07T20:24:45.1434334Z 
2025-05-07T20:24:45.1434339Z 
2025-05-07T20:24:45.1487998Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1488478Z 
2025-05-07T20:24:45.1488483Z 
2025-05-07T20:24:45.1488496Z 
2025-05-07T20:24:45.1488501Z 
2025-05-07T20:24:45.1488506Z 
2025-05-07T20:24:45.1488511Z 
2025-05-07T20:24:45.1488517Z 
2025-05-07T20:24:45.1488521Z 
2025-05-07T20:24:45.1488526Z 
2025-05-07T20:24:45.1489200Z 
2025-05-07T20:24:45.1811765Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1845103Z gcc_impl_linux-64-11 | 53.0 MB   | ####1      |  42% 
2025-05-07T20:24:45.1845446Z 
2025-05-07T20:24:45.1845463Z 
2025-05-07T20:24:45.1845469Z 
2025-05-07T20:24:45.1845717Z 
2025-05-07T20:24:45.1845724Z 
2025-05-07T20:24:45.1845730Z 
2025-05-07T20:24:45.1845748Z 
2025-05-07T20:24:45.1845754Z 
2025-05-07T20:24:45.1845759Z 
2025-05-07T20:24:45.1845765Z 
2025-05-07T20:24:45.1845770Z 
2025-05-07T20:24:45.1881044Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1881481Z 
2025-05-07T20:24:45.1881486Z 
2025-05-07T20:24:45.1881491Z 
2025-05-07T20:24:45.1881496Z 
2025-05-07T20:24:45.1881501Z 
2025-05-07T20:24:45.1881507Z 
2025-05-07T20:24:45.1881512Z 
2025-05-07T20:24:45.1881517Z 
2025-05-07T20:24:45.1881522Z 
2025-05-07T20:24:45.1881527Z 
2025-05-07T20:24:45.1881532Z 
2025-05-07T20:24:45.1918560Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.1922763Z 
2025-05-07T20:24:45.2050297Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:45.2050652Z 
2025-05-07T20:24:45.2050658Z 
2025-05-07T20:24:45.2050664Z 
2025-05-07T20:24:45.2051187Z 
2025-05-07T20:24:45.2814402Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:45.2894180Z gcc_impl_linux-64-11 | 53.0 MB   | ####9      |  50% 
2025-05-07T20:24:45.2894428Z 
2025-05-07T20:24:45.2895939Z 
2025-05-07T20:24:45.2904488Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:45.2904761Z 
2025-05-07T20:24:45.2905925Z 
2025-05-07T20:24:45.3230315Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:45.3230717Z 
2025-05-07T20:24:45.3230724Z 
2025-05-07T20:24:45.3230729Z 
2025-05-07T20:24:45.3230735Z 
2025-05-07T20:24:45.3230740Z 
2025-05-07T20:24:45.3230746Z 
2025-05-07T20:24:45.3545820Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:45.3546258Z 
2025-05-07T20:24:45.3546263Z 
2025-05-07T20:24:45.3546268Z 
2025-05-07T20:24:45.3546273Z 
2025-05-07T20:24:45.3546278Z 
2025-05-07T20:24:45.3546284Z 
2025-05-07T20:24:45.3546289Z 
2025-05-07T20:24:45.3548692Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:45.3549070Z 
2025-05-07T20:24:45.3549085Z 
2025-05-07T20:24:45.3549089Z 
2025-05-07T20:24:45.3549092Z 
2025-05-07T20:24:45.3549096Z 
2025-05-07T20:24:45.3549100Z 
2025-05-07T20:24:45.3549103Z 
2025-05-07T20:24:45.3962789Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:45.4387011Z gcc_impl_linux-64-11 | 53.0 MB   | #####6     |  56% 
2025-05-07T20:24:45.4387277Z 
2025-05-07T20:24:45.4387281Z 
2025-05-07T20:24:45.4387285Z 
2025-05-07T20:24:45.4387288Z 
2025-05-07T20:24:45.4387292Z 
2025-05-07T20:24:45.4387296Z 
2025-05-07T20:24:45.4387300Z 
2025-05-07T20:24:45.4390106Z 
2025-05-07T20:24:45.4394321Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:45.4394788Z 
2025-05-07T20:24:45.4394795Z 
2025-05-07T20:24:45.4394800Z 
2025-05-07T20:24:45.4394805Z 
2025-05-07T20:24:45.4394810Z 
2025-05-07T20:24:45.4394815Z 
2025-05-07T20:24:45.4394820Z 
2025-05-07T20:24:45.4395127Z 
2025-05-07T20:24:45.4963439Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:45.5020801Z gcc_impl_linux-64-11 | 53.0 MB   | ######5    |  65% 
2025-05-07T20:24:45.5021185Z 
2025-05-07T20:24:45.5021191Z 
2025-05-07T20:24:45.5021197Z 
2025-05-07T20:24:45.5021201Z 
2025-05-07T20:24:45.5021206Z 
2025-05-07T20:24:45.5021211Z 
2025-05-07T20:24:45.5021216Z 
2025-05-07T20:24:45.5021221Z 
2025-05-07T20:24:45.5021226Z 
2025-05-07T20:24:45.5026010Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.5026433Z 
2025-05-07T20:24:45.5026439Z 
2025-05-07T20:24:45.5026444Z 
2025-05-07T20:24:45.5026450Z 
2025-05-07T20:24:45.5026454Z 
2025-05-07T20:24:45.5026459Z 
2025-05-07T20:24:45.5026464Z 
2025-05-07T20:24:45.5026469Z 
2025-05-07T20:24:45.5026474Z 
2025-05-07T20:24:45.5662402Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.5662859Z 
2025-05-07T20:24:45.5662866Z 
2025-05-07T20:24:45.5663192Z 
2025-05-07T20:24:45.5663202Z 
2025-05-07T20:24:45.5663207Z 
2025-05-07T20:24:45.5663226Z 
2025-05-07T20:24:45.5663232Z 
2025-05-07T20:24:45.5663237Z 
2025-05-07T20:24:45.5663256Z 
2025-05-07T20:24:45.5663824Z 
2025-05-07T20:24:45.5669548Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.5669987Z 
2025-05-07T20:24:45.5670005Z 
2025-05-07T20:24:45.5670010Z 
2025-05-07T20:24:45.5670015Z 
2025-05-07T20:24:45.5670020Z 
2025-05-07T20:24:45.5670026Z 
2025-05-07T20:24:45.5670031Z 
2025-05-07T20:24:45.5670035Z 
2025-05-07T20:24:45.5670039Z 
2025-05-07T20:24:45.5670042Z 
2025-05-07T20:24:45.5934199Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.5934519Z 
2025-05-07T20:24:45.5934523Z 
2025-05-07T20:24:45.5934527Z 
2025-05-07T20:24:45.5934531Z 
2025-05-07T20:24:45.5934534Z 
2025-05-07T20:24:45.5964291Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:45.6256096Z gcc_impl_linux-64-11 | 53.0 MB   | #######3   |  74% 
2025-05-07T20:24:45.6256579Z 
2025-05-07T20:24:45.6256592Z 
2025-05-07T20:24:45.6256596Z 
2025-05-07T20:24:45.6256599Z 
2025-05-07T20:24:45.6256603Z 
2025-05-07T20:24:45.6256607Z 
2025-05-07T20:24:45.6256610Z 
2025-05-07T20:24:45.6256614Z 
2025-05-07T20:24:45.6256617Z 
2025-05-07T20:24:45.6256621Z 
2025-05-07T20:24:45.6256624Z 
2025-05-07T20:24:45.6265872Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.6266184Z 
2025-05-07T20:24:45.6266188Z 
2025-05-07T20:24:45.6266192Z 
2025-05-07T20:24:45.6266195Z 
2025-05-07T20:24:45.6266199Z 
2025-05-07T20:24:45.6266202Z 
2025-05-07T20:24:45.6266206Z 
2025-05-07T20:24:45.6266209Z 
2025-05-07T20:24:45.6266213Z 
2025-05-07T20:24:45.6266216Z 
2025-05-07T20:24:45.6267176Z 
2025-05-07T20:24:45.6966859Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:45.7663536Z gcc_impl_linux-64-11 | 53.0 MB   | ########1  |  81% 
2025-05-07T20:24:45.7663802Z 
2025-05-07T20:24:45.7663807Z 
2025-05-07T20:24:45.7665369Z 
2025-05-07T20:24:45.7968312Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:46.0437922Z gcc_impl_linux-64-11 | 53.0 MB   | #########  |  91% 
2025-05-07T20:24:46.0438461Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:46.0699689Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:46.0701975Z 
2025-05-07T20:24:46.3130933Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:46.3131422Z 
2025-05-07T20:24:46.3131437Z 
2025-05-07T20:24:46.7440240Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:46.7446656Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:46.7447099Z                                                      
2025-05-07T20:24:46.7447393Z 
2025-05-07T20:24:46.7447589Z                                                      [A
2025-05-07T20:24:46.7447828Z 
2025-05-07T20:24:46.7447833Z 
2025-05-07T20:24:46.7447994Z                                                      [A[A
2025-05-07T20:24:46.7448219Z 
2025-05-07T20:24:46.7448223Z 
2025-05-07T20:24:46.7448226Z 
2025-05-07T20:24:46.7448395Z                                                      [A[A[A
2025-05-07T20:24:46.7448598Z 
2025-05-07T20:24:46.7448602Z 
2025-05-07T20:24:46.7448616Z 
2025-05-07T20:24:46.7448626Z 
2025-05-07T20:24:46.7448793Z                                                      [A[A[A[A
2025-05-07T20:24:46.7449001Z 
2025-05-07T20:24:46.7449004Z 
2025-05-07T20:24:46.7449017Z 
2025-05-07T20:24:46.7449023Z 
2025-05-07T20:24:46.7449028Z 
2025-05-07T20:24:46.7449269Z                                                      [A[A[A[A[A
2025-05-07T20:24:46.7449575Z 
2025-05-07T20:24:46.7449580Z 
2025-05-07T20:24:46.7449584Z 
2025-05-07T20:24:46.7449589Z 
2025-05-07T20:24:46.7449607Z 
2025-05-07T20:24:46.7449613Z 
2025-05-07T20:24:46.7450124Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:46.7450350Z 
2025-05-07T20:24:46.7450354Z 
2025-05-07T20:24:46.7450363Z 
2025-05-07T20:24:46.7450367Z 
2025-05-07T20:24:46.7450371Z 
2025-05-07T20:24:46.7450381Z 
2025-05-07T20:24:46.7450384Z 
2025-05-07T20:24:46.7450583Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:46.7450796Z 
2025-05-07T20:24:46.7450806Z 
2025-05-07T20:24:46.7450809Z 
2025-05-07T20:24:46.7450813Z 
2025-05-07T20:24:46.7450816Z 
2025-05-07T20:24:46.7450820Z 
2025-05-07T20:24:46.7450824Z 
2025-05-07T20:24:46.7450827Z 
2025-05-07T20:24:46.7451015Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:46.7451312Z 
2025-05-07T20:24:46.7451318Z 
2025-05-07T20:24:46.7451323Z 
2025-05-07T20:24:46.7451328Z 
2025-05-07T20:24:46.7451332Z 
2025-05-07T20:24:46.7451337Z 
2025-05-07T20:24:46.7451343Z 
2025-05-07T20:24:46.7451348Z 
2025-05-07T20:24:46.7451353Z 
2025-05-07T20:24:46.7451624Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:46.7451953Z 
2025-05-07T20:24:46.7452281Z 
2025-05-07T20:24:46.7452287Z 
2025-05-07T20:24:46.7452292Z 
2025-05-07T20:24:46.7452297Z 
2025-05-07T20:24:46.7452302Z 
2025-05-07T20:24:46.7452307Z 
2025-05-07T20:24:46.7452313Z 
2025-05-07T20:24:46.7452318Z 
2025-05-07T20:24:46.7452323Z 
2025-05-07T20:24:46.7452605Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:46.7452941Z 
2025-05-07T20:24:46.7452947Z 
2025-05-07T20:24:46.7452952Z 
2025-05-07T20:24:46.7452958Z 
2025-05-07T20:24:46.7452963Z 
2025-05-07T20:24:46.7452967Z 
2025-05-07T20:24:46.7452972Z 
2025-05-07T20:24:46.7452977Z 
2025-05-07T20:24:46.7452982Z 
2025-05-07T20:24:46.7452988Z 
2025-05-07T20:24:46.7452993Z 
2025-05-07T20:24:46.7453278Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:46.8458358Z Preparing transaction: \ done
2025-05-07T20:24:47.0463660Z Verifying transaction: / - done
2025-05-07T20:24:47.1473188Z Executing transaction: | done
2025-05-07T20:24:47.3127374Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:51.2195116Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:51.2195695Z 
2025-05-07T20:24:51.2209982Z 
2025-05-07T20:24:51.2228317Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:51.2228939Z 
2025-05-07T20:24:51.2240205Z 
2025-05-07T20:24:51.2257790Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:51.2258556Z 
2025-05-07T20:24:51.2270855Z 
2025-05-07T20:24:51.2289168Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:51.2289722Z 
2025-05-07T20:24:51.2303457Z 
2025-05-07T20:24:53.1083347Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:53.1083664Z 
2025-05-07T20:24:53.1719176Z [CHECK] Binary cc found in PATH
2025-05-07T20:24:55.0551460Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:55.0551749Z 
2025-05-07T20:24:55.1175145Z [CHECK] Binary gcc found in PATH
2025-05-07T20:24:56.9989003Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:56.9989307Z 
2025-05-07T20:24:57.0609132Z [CHECK] Binary c++ found in PATH
2025-05-07T20:24:58.9388232Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:58.9388512Z 
2025-05-07T20:24:59.0015608Z [CHECK] Binary g++ found in PATH
2025-05-07T20:24:59.0019527Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:24:59.0020146Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:24:59.0020377Z 
2025-05-07T20:25:00.8953193Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:00.8953724Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:00.8954631Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:00.8954913Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:00.8955261Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:00.8955616Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:00.8955901Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:00.8956203Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:00.8956467Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:00.8956717Z #define __CHAR_BIT__ 8
2025-05-07T20:25:00.8956951Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:00.8957193Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:00.8957443Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:00.8957712Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:00.8957981Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:00.8958286Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.8958588Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:00.8958870Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:00.8959204Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:00.8959705Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:00.8960107Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:00.8960525Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:00.8960835Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:00.8961116Z #define __GCC_IEC_559 2
2025-05-07T20:25:00.8961354Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:00.8961627Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:00.8961892Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:00.8962164Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:00.8962493Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.8962817Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:00.8963077Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:00.8963355Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:00.8963628Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:00.8963884Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:00.8964144Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:00.8964415Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:00.8964669Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:00.8964920Z #define __INT8_C(c) c
2025-05-07T20:25:00.8965159Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:00.8965459Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.8965776Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:00.8966094Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:00.8966450Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:00.8966721Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:00.8966987Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.8967262Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:00.8967534Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:00.8967928Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:00.8968346Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:00.8968629Z #define __linux 1
2025-05-07T20:25:00.8968888Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:00.8969188Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:00.8969468Z #define __unix 1
2025-05-07T20:25:00.8969687Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:00.8969965Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:00.8970247Z #define __WINT_MIN__ 0U
2025-05-07T20:25:00.8970494Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:00.8970781Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:00.8971045Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:00.8971315Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:00.8971567Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:00.8981291Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:00.8981610Z #define __INT64_C(c) c ## L
2025-05-07T20:25:00.8981893Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:00.8982356Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:00.8982633Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:00.8983004Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:00.8983388Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:00.8983640Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:00.8983902Z #define __DBL_DIG__ 15
2025-05-07T20:25:00.8984140Z #define __FLT32_DIG__ 6
2025-05-07T20:25:00.8984443Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:00.8984800Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:00.8985055Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:00.8985379Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:00.8985731Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:00.8985983Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:00.8986246Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:00.8986627Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:00.8987043Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:00.8987325Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:00.8987677Z #define __unix__ 1
2025-05-07T20:25:00.8987905Z #define __INT_WIDTH__ 32
2025-05-07T20:25:00.8988150Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:00.8988392Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:00.8988651Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:00.8988918Z #define __UINT16_C(c) c
2025-05-07T20:25:00.8989154Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:00.8989414Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:00.8989778Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:00.8990150Z #define __gnu_linux__ 1
2025-05-07T20:25:00.8990391Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:00.8990673Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:00.8990958Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.8991228Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:00.8991493Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:00.8991749Z #define __GNUC__ 11
2025-05-07T20:25:00.8991965Z #define __pie__ 2
2025-05-07T20:25:00.8992184Z #define __MMX__ 1
2025-05-07T20:25:00.8992408Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:00.8992668Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:00.8992946Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:00.8993219Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:00.8993560Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:00.8993965Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.8994288Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:00.8994556Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:00.8994814Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:00.8995114Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:00.8995384Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:00.8995642Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:00.8995930Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:00.8996233Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:00.8996495Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:00.8996780Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:00.8997034Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:00.8997293Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:00.8997564Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:00.8997830Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:00.8998081Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:00.8998401Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:00.8998764Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:00.8999033Z #define __SSE2_MATH__ 1
2025-05-07T20:25:00.8999275Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:00.8999577Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.8999866Z #define __amd64 1
2025-05-07T20:25:00.9000085Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:00.9000354Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:00.9000757Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:00.9001068Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:00.9001326Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:00.9001610Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:00.9001861Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:00.9002127Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:00.9002391Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:00.9002649Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:00.9002918Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:00.9003197Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:00.9003444Z #define __x86_64 1
2025-05-07T20:25:00.9003758Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:00.9004124Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:00.9004592Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:00.9005055Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:00.9005534Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:00.9005920Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:00.9006629Z #define __LP64__ 1
2025-05-07T20:25:00.9006895Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9007246Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:00.9007632Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:00.9007912Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:00.9008184Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:00.9008469Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:00.9008749Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:00.9009023Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:00.9009279Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:00.9009544Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:00.9009810Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:00.9010135Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:00.9010503Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:00.9010783Z #define __FLT_DIG__ 6
2025-05-07T20:25:00.9011008Z #define __NO_INLINE__ 1
2025-05-07T20:25:00.9011257Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:00.9011584Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:00.9011927Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:00.9012279Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:00.9012541Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:00.9012791Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:00.9013051Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:00.9013307Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:00.9013602Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:00.9013881Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:00.9014146Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:00.9014454Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:00.9014777Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:00.9015044Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:00.9015311Z #define __FLT128_DIG__ 33
2025-05-07T20:25:00.9015544Z #define __INT32_C(c) c
2025-05-07T20:25:00.9015789Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:00.9016068Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:00.9016344Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:00.9016625Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:00.9016943Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:00.9017244Z #define unix 1
2025-05-07T20:25:00.9017475Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:00.9017791Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.9018099Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:00.9018404Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:00.9018739Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:00.9018997Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:00.9019255Z #define __ELF__ 1
2025-05-07T20:25:00.9019488Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:00.9020034Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:00.9020314Z #define __FLT_RADIX__ 2
2025-05-07T20:25:00.9020576Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:00.9020939Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:00.9021303Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:00.9021560Z #define __SSE_MATH__ 1
2025-05-07T20:25:00.9021788Z #define __k8 1
2025-05-07T20:25:00.9022080Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:00.9022460Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:00.9022757Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:00.9023062Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:00.9023318Z #define __LDBL_DIG__ 18
2025-05-07T20:25:00.9023561Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:00.9023820Z #define __x86_64__ 1
2025-05-07T20:25:00.9024055Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:00.9024356Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:00.9024701Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9025002Z #define __FLT64_DIG__ 15
2025-05-07T20:25:00.9025433Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.9025789Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:00.9026101Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9026371Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:00.9026650Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9026948Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:00.9027309Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:00.9027708Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:00.9027998Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:00.9028332Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:00.9028660Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:00.9029010Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:00.9029292Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:00.9029601Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:00.9029888Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:00.9030122Z #define __SEG_FS 1
2025-05-07T20:25:00.9030352Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:00.9030628Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:00.9030904Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9031185Z #define __SEG_GS 1
2025-05-07T20:25:00.9031500Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:00.9031886Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:00.9032155Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:00.9032444Z #define __INT16_TYPE__ short int
2025-05-07T20:25:00.9032727Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:00.9033017Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:00.9033282Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:00.9033548Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:00.9033809Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:00.9034156Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:00.9034552Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9034837Z #define linux 1
2025-05-07T20:25:00.9035067Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9035348Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:00.9035622Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:00.9035870Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:00.9036131Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:00.9036395Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:00.9036736Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:00.9037149Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:00.9037485Z #define __code_model_small__ 1
2025-05-07T20:25:00.9037752Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:00.9038037Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:00.9038283Z #define __k8__ 1
2025-05-07T20:25:00.9038592Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:00.9038883Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:00.9039186Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:00.9039423Z #define __pic__ 2
2025-05-07T20:25:00.9039670Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.9039984Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:00.9040277Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9040600Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:00.9040971Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:00.9041333Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:00.9041595Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:00.9041888Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:00.9042203Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:00.9042447Z #define __linux__ 1
2025-05-07T20:25:00.9042675Z #define __INT64_TYPE__ long int
2025-05-07T20:25:00.9042938Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:00.9043199Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:00.9043473Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:00.9043862Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:00.9044158Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9044484Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:00.9044785Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:00.9045055Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:00.9045344Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:00.9045645Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:00.9045981Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:00.9046337Z #define __SSE__ 1
2025-05-07T20:25:00.9046564Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:00.9046906Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:00.9047246Z #define __amd64__ 1
2025-05-07T20:25:00.9047472Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:00.9047732Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:00.9047996Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:00.9048274Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:00.9048543Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:00.9048820Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:00.9049079Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:00.9049354Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:00.9049624Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:00.9049975Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:00.9050449Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:00.9050809Z #define _LP64 1
2025-05-07T20:25:00.9051021Z #define __UINT8_C(c) c
2025-05-07T20:25:00.9051271Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:00.9051541Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:00.9051807Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:00.9052210Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:00.9052524Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:00.9052887Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:00.9053357Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:00.9053732Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9054030Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:00.9054340Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:00.9054709Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:00.9055080Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:00.9055340Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:00.9055685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:00.9056056Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:00.9056320Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:00.9056566Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:00.9056823Z #define __FXSR__ 1
2025-05-07T20:25:00.9057226Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:00.9057687Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:00.9058111Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:00.9058421Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:00.9058672Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:00.9059011Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:00.9059370Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:00.9059609Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:00.9059849Z #define __PIC__ 2
2025-05-07T20:25:00.9060099Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:00.9060502Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:00.9060890Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:00.9061225Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:00.9061555Z #define __SSE2__ 1
2025-05-07T20:25:00.9061778Z #define __INT32_TYPE__ int
2025-05-07T20:25:00.9062029Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:00.9062455Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:00.9062788Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:00.9063148Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:00.9063422Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:00.9063688Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:00.9063959Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9064239Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:00.9064482Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:00.9064732Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:00.9065024Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.9065327Z #define __PIE__ 2
2025-05-07T20:25:00.9065646Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:00.9066043Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:00.9066397Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:00.9066757Z #define __INT16_C(c) c
2025-05-07T20:25:00.9066990Z #define __STDC__ 1
2025-05-07T20:25:00.9067223Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:00.9067493Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:00.9067748Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:00.9068047Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:00.9068390Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:00.9068723Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:00.9068988Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:00.9069268Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:00.9069527Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:00.9069809Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:00.9070097Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:00.9070364Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:00.9070661Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:00.9071063Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:00.9071440Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:00.9071742Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:00.9072036Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:00.9072280Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:00.9072445Z 
2025-05-07T20:25:00.9589588Z 
2025-05-07T20:25:00.9590146Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:00.9590615Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:00.9590856Z 
2025-05-07T20:25:02.8432152Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:02.8432732Z #define __cpp_attributes 200809L
2025-05-07T20:25:02.8433220Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:02.8433694Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:02.8434076Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:02.8434365Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:02.8435043Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:02.8435408Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:02.8435704Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:02.8436011Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:02.8436327Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:02.8436594Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:02.8436841Z #define __CHAR_BIT__ 8
2025-05-07T20:25:02.8437085Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:02.8437336Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:02.8437590Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:02.8437862Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:02.8438139Z #define __cpp_static_assert 201411L
2025-05-07T20:25:02.8438428Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:02.8438723Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8439025Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:02.8439316Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:02.8439647Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:02.8439975Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:02.8440529Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:02.8440941Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:02.8441255Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:02.8441538Z #define __GCC_IEC_559 2
2025-05-07T20:25:02.8441787Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:02.8442057Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:02.8442334Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:02.8442627Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:02.8442916Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:02.8443239Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:02.8443558Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:02.8443889Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8444216Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:02.8444497Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:02.8444769Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:02.8445060Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:02.8445366Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:02.8445638Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:02.8445898Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:02.8446179Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:02.8446517Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:02.8446846Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:02.8447121Z #define __INT8_C(c) c
2025-05-07T20:25:02.8447363Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:02.8447633Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:02.8447959Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8448289Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:02.8448568Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:02.8448854Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:02.8449179Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:02.8449544Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:02.8449824Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:02.8450105Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:02.8450373Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8450647Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:02.8450928Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:02.8460146Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:02.8460569Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:02.8460862Z #define __linux 1
2025-05-07T20:25:02.8461091Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:02.8461368Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:02.8461651Z #define __unix 1
2025-05-07T20:25:02.8461880Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:02.8462170Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:02.8462576Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:02.8462853Z #define __WINT_MIN__ 0U
2025-05-07T20:25:02.8463108Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:02.8463384Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:02.8463662Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:02.8463930Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:02.8464180Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:02.8464462Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:02.8464760Z #define __INT64_C(c) c ## L
2025-05-07T20:25:02.8465017Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:02.8465317Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:02.8465589Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:02.8465887Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:02.8466170Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:02.8466439Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:02.8466802Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:02.8467183Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:02.8467441Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:02.8467812Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:02.8468084Z #define __DBL_DIG__ 15
2025-05-07T20:25:02.8468317Z #define __FLT32_DIG__ 6
2025-05-07T20:25:02.8468627Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:02.8468973Z #define __GXX_WEAK__ 1
2025-05-07T20:25:02.8469212Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:02.8469464Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:02.8469788Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:02.8470139Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:02.8470401Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:02.8470697Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:02.8471016Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:02.8471432Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:02.8471832Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:02.8472107Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:02.8472361Z #define __unix__ 1
2025-05-07T20:25:02.8472578Z #define __INT_WIDTH__ 32
2025-05-07T20:25:02.8472818Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:02.8473052Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:02.8473303Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:02.8473567Z #define __UINT16_C(c) c
2025-05-07T20:25:02.8473793Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:02.8474044Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:02.8474402Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:02.8474755Z #define __gnu_linux__ 1
2025-05-07T20:25:02.8474988Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:02.8475247Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:02.8475514Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:02.8475796Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8476063Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:02.8476325Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:02.8476573Z #define __GNUC__ 11
2025-05-07T20:25:02.8476784Z #define __GXX_RTTI 1
2025-05-07T20:25:02.8477004Z #define __pie__ 2
2025-05-07T20:25:02.8477207Z #define __MMX__ 1
2025-05-07T20:25:02.8477432Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:02.8477695Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:02.8477968Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:02.8478239Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:02.8478485Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:02.8478777Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:02.8479094Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:02.8479474Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:02.8479861Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:02.8480165Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8480482Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:02.8480842Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:02.8481105Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:02.8481418Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:02.8481714Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:02.8481972Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:02.8482233Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:02.8482514Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:02.8482801Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:02.8483070Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:02.8483352Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:02.8483600Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:02.8483863Z #define __cplusplus 201703L
2025-05-07T20:25:02.8484132Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:02.8484409Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:02.8484668Z #define __DEPRECATED 1
2025-05-07T20:25:02.8484922Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:02.8485224Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:02.8485476Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:02.8485796Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:02.8486243Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:02.8486510Z #define __SSE2_MATH__ 1
2025-05-07T20:25:02.8486756Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:02.8487057Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8487346Z #define __amd64 1
2025-05-07T20:25:02.8487571Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:02.8487845Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:02.8488104Z #define __GNUG__ 11
2025-05-07T20:25:02.8488360Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:02.8488673Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:02.8488920Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:02.8489179Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:02.8489452Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:02.8489721Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:02.8490033Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:02.8490327Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:02.8490599Z #define __cpp_hex_float 201603L
2025-05-07T20:25:02.8490861Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:02.8491130Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:02.8491402Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:02.8491664Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:02.8491932Z #define __x86_64 1
2025-05-07T20:25:02.8492275Z #define __cpp_lambdas 200907L
2025-05-07T20:25:02.8492540Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:02.8492917Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:02.8493307Z #define __cpp_template_auto 201606L
2025-05-07T20:25:02.8493661Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:02.8494117Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:02.8494596Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:02.8494988Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:02.8495242Z #define __LP64__ 1
2025-05-07T20:25:02.8495476Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8495830Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:02.8496204Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:02.8496479Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:02.8496761Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:02.8497029Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:02.8497302Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:02.8497563Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:02.8497829Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:02.8498154Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:02.8498519Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:02.8498794Z #define __FLT_DIG__ 6
2025-05-07T20:25:02.8499017Z #define __NO_INLINE__ 1
2025-05-07T20:25:02.8499413Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:02.8499770Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:02.8500121Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:02.8500379Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:02.8500645Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:02.8500895Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:02.8501172Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:02.8501472Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:02.8501719Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:02.8502013Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:02.8502296Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:02.8502563Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:02.8502859Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:02.8503200Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:02.8503489Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:02.8503744Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:02.8504008Z #define __FLT128_DIG__ 33
2025-05-07T20:25:02.8504244Z #define __INT32_C(c) c
2025-05-07T20:25:02.8504563Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:02.8504840Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:02.8505117Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:02.8505388Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:02.8505699Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:02.8506006Z #define unix 1
2025-05-07T20:25:02.8506561Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:02.8506824Z #define __cpp_rtti 199711L
2025-05-07T20:25:02.8507083Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:02.8507394Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8507687Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:02.8507990Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:02.8508313Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:02.8508552Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:02.8508843Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:02.8509118Z #define __ELF__ 1
2025-05-07T20:25:02.8509343Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:02.8509624Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:02.8509899Z #define __FLT_RADIX__ 2
2025-05-07T20:25:02.8510137Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:02.8510490Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:02.8510849Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:02.8511112Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:02.8511383Z #define __k8 1
2025-05-07T20:25:02.8511677Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:02.8512048Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:02.8512331Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:02.8512624Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:02.8512880Z #define __LDBL_DIG__ 18
2025-05-07T20:25:02.8513115Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:02.8513372Z #define __x86_64__ 1
2025-05-07T20:25:02.8513606Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:02.8513901Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:02.8514235Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8514538Z #define __FLT64_DIG__ 15
2025-05-07T20:25:02.8514807Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8515152Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:02.8515469Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8515734Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:02.8516000Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8516293Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:02.8516657Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:02.8517044Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:02.8517332Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:02.8517798Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:02.8518113Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:02.8518435Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:02.8518726Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:02.8518997Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:02.8519298Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:02.8519574Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:02.8519808Z #define __SEG_FS 1
2025-05-07T20:25:02.8520024Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:02.8520295Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:02.8520567Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8520840Z #define __SEG_GS 1
2025-05-07T20:25:02.8521151Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:02.8521529Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:02.8521791Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:02.8522075Z #define __INT16_TYPE__ short int
2025-05-07T20:25:02.8522359Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:02.8522656Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:02.8523104Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:02.8523348Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:02.8523598Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:02.8523939Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:02.8524327Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8524638Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:02.8524960Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:02.8525265Z #define linux 1
2025-05-07T20:25:02.8525489Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8525761Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:02.8526035Z #define __EXCEPTIONS 1
2025-05-07T20:25:02.8526276Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:02.8526575Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:02.8526848Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:02.8527137Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:02.8527492Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:02.8527889Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:02.8528232Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:02.8528562Z #define __code_model_small__ 1
2025-05-07T20:25:02.8528835Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:02.8529139Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:02.8529446Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:02.8529725Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:02.8530018Z #define __k8__ 1
2025-05-07T20:25:02.8530239Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:02.8530523Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:02.8530818Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:02.8531050Z #define __pic__ 2
2025-05-07T20:25:02.8531300Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8531615Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:02.8531880Z #define __cpp_decltype 200707L
2025-05-07T20:25:02.8532239Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8532569Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:02.8532932Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:02.8533293Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:02.8533587Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:02.8533905Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:02.8534194Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:02.8534445Z #define __linux__ 1
2025-05-07T20:25:02.8534670Z #define __INT64_TYPE__ long int
2025-05-07T20:25:02.8534928Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:02.8535190Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:02.8535466Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:02.8535746Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:02.8536066Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:02.8536448Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8536766Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:02.8537034Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:02.8537327Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:02.8537617Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:02.8537951Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:02.8538310Z #define __SSE__ 1
2025-05-07T20:25:02.8538535Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:02.8538868Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:02.8539212Z #define __amd64__ 1
2025-05-07T20:25:02.8539438Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:02.8539685Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:02.8539954Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:02.8540220Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:02.8540487Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:02.8540760Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:02.8541035Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:02.8541380Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:02.8541731Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:02.8542198Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:02.8542551Z #define _LP64 1
2025-05-07T20:25:02.8542766Z #define __UINT8_C(c) c
2025-05-07T20:25:02.8543004Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:02.8543271Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:02.8543536Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:02.8543799Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:02.8544162Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:02.8544628Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:02.8545005Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8545299Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:02.8545612Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:02.8545923Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:02.8546312Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:02.8546683Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:02.8546941Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:02.8547203Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:02.8547545Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:02.8547906Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:02.8548164Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:02.8548412Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:02.8548656Z #define __FXSR__ 1
2025-05-07T20:25:02.8548957Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:02.8549415Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:02.8549819Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:02.8550136Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:02.8550401Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:02.8550709Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:02.8550997Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:02.8551266Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:02.8551630Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:02.8551993Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:02.8552260Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:02.8552506Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:02.8552737Z #define __PIC__ 2
2025-05-07T20:25:02.8552988Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:02.8553387Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:02.8553767Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:02.8554100Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:02.8554536Z #define __cpp_constexpr 201603L
2025-05-07T20:25:02.8554802Z #define __SSE2__ 1
2025-05-07T20:25:02.8555032Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:02.8555322Z #define __INT32_TYPE__ int
2025-05-07T20:25:02.8555569Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:02.8555824Z #define __cpp_exceptions 199711L
2025-05-07T20:25:02.8556096Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:02.8556428Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:02.8556781Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:02.8557052Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:02.8557318Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:02.8557583Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8557856Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:02.8558102Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:02.8558348Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:02.8558638Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:02.8558926Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8559226Z #define __PIE__ 2
2025-05-07T20:25:02.8559545Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:02.8560046Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:02.8560357Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:02.8560696Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:02.8561061Z #define __INT16_C(c) c
2025-05-07T20:25:02.8561285Z #define __STDC__ 1
2025-05-07T20:25:02.8561496Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:02.8561747Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:02.8562014Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:02.8562262Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:02.8562557Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:02.8562902Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:02.8563236Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:02.8563494Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:02.8563784Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:02.8564064Z #define __SSE_MATH__ 1
2025-05-07T20:25:02.8564302Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:02.8564582Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:02.8564887Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:02.8565161Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:02.8565449Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:02.8565719Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:02.8566009Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:02.8566413Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:02.8566791Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:02.8567084Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:02.8567371Z #define _GNU_SOURCE 1
2025-05-07T20:25:02.8567612Z #define __cpp_init_captures 201304L
2025-05-07T20:25:02.8567893Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:02.8568139Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:02.8568301Z 
2025-05-07T20:25:02.9055088Z 
2025-05-07T20:25:02.9055529Z + conda run -n build_binary c++ --version
2025-05-07T20:25:02.9055759Z 
2025-05-07T20:25:04.7865413Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:04.7865855Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:04.7874457Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:04.7875066Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:04.7875457Z 
2025-05-07T20:25:04.7875462Z 
2025-05-07T20:25:04.8498883Z 
2025-05-07T20:25:04.8499395Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:04.8500185Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:04.8500513Z 
2025-05-07T20:25:06.8031175Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:06.8033511Z 
2025-05-07T20:25:06.8035030Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:06.8036368Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:06.8037007Z 
2025-05-07T20:25:08.7536554Z #define __cplusplus 201703L
2025-05-07T20:25:08.7538667Z 
2025-05-07T20:25:08.7539584Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:08.7574993Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:08.7575416Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:08.7588427Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:08.7588774Z env:
2025-05-07T20:25:08.7589001Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:08.7589318Z   BUILD_ENV: build_binary
2025-05-07T20:25:08.7589569Z   BUILD_TARGET: genai
2025-05-07T20:25:08.7589797Z   BUILD_VARIANT: cuda
2025-05-07T20:25:08.7590039Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:08.7590300Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:08.7590602Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:08.7590953Z ##[endgroup]
2025-05-07T20:25:09.0968811Z ################################################################################
2025-05-07T20:25:09.0969165Z # Install CUDA
2025-05-07T20:25:09.0969367Z #
2025-05-07T20:25:09.0985783Z # [2025-05-07T20:25:09.098Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:09.0986174Z ################################################################################
2025-05-07T20:25:09.0986386Z 
2025-05-07T20:25:09.1002325Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:09.1927702Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:09.1928258Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:09.1933701Z + conda clean --packages --tarball -y
2025-05-07T20:25:09.1933971Z 
2025-05-07T20:25:10.0609502Z Will remove 40 (182.7 MB) tarball(s).
2025-05-07T20:25:10.0610076Z Will remove 7 (108.6 MB) package(s).
2025-05-07T20:25:10.1238879Z 
2025-05-07T20:25:10.1248507Z + conda clean --all -y
2025-05-07T20:25:10.1248708Z 
2025-05-07T20:25:10.7943049Z There are no unused tarball(s) to remove.
2025-05-07T20:25:10.7943467Z Will remove 1 index cache(s).
2025-05-07T20:25:10.7943779Z There are no unused package(s) to remove.
2025-05-07T20:25:10.7944124Z There are no tempfile(s) to remove.
2025-05-07T20:25:10.7944454Z There are no logfile(s) to remove.
2025-05-07T20:25:10.8572425Z 
2025-05-07T20:25:10.8586893Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:10.8610950Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:11.7719407Z Channels:
2025-05-07T20:25:11.7719731Z  - conda-forge
2025-05-07T20:25:11.7720027Z Platform: linux-64
2025-05-07T20:25:22.3559659Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:23.4664875Z Solving environment: - \ | / - done
2025-05-07T20:25:23.5430708Z 
2025-05-07T20:25:23.5431233Z ## Package Plan ##
2025-05-07T20:25:23.5431427Z 
2025-05-07T20:25:23.5431683Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:23.5432014Z 
2025-05-07T20:25:23.5432111Z   added / updated specs:
2025-05-07T20:25:23.5432362Z     - cuda=12.8.0
2025-05-07T20:25:23.5432497Z 
2025-05-07T20:25:23.5432513Z 
2025-05-07T20:25:23.5432641Z The following packages will be downloaded:
2025-05-07T20:25:23.5432860Z 
2025-05-07T20:25:23.5432982Z     package                    |            build
2025-05-07T20:25:23.5433419Z     ---------------------------|-----------------
2025-05-07T20:25:23.5433928Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:23.5434492Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:23.5434992Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:23.5435559Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:23.5436175Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:23.5436816Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:25:23.5438701Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:23.5439437Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:23.5440088Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:25:23.5440566Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:25:23.5441025Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:23.5441498Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:25:23.5441997Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:25:23.5442500Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:23.5443198Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:25:23.5443720Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:25:23.5444211Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:25:23.5444662Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:25:23.5445126Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:25:23.5445591Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:25:23.5446058Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:23.5446559Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:25:23.5447024Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:25:23.5447469Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:23.5447955Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:23.5448424Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:23.5448868Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:23.5449338Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:25:23.5449825Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:25:23.5450294Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:25:23.5450768Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:23.5451242Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:25:23.5451701Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:25:23.5452283Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:25:23.5452741Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:25:23.5453197Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:25:23.5453643Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:25:23.5454092Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:23.5454557Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:25:23.5455037Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:25:23.5455497Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:25:23.5455945Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:25:23.5456382Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:23.5456841Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:25:23.5457464Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:23.5457941Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:23.5458417Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:25:23.5458884Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:23.5459324Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:23.5459764Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:25:23.5460225Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:23.5460695Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:23.5461248Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:23.5461714Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:23.5462245Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:23.5462770Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:23.5463264Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:23.5463713Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:23.5464182Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:23.5464653Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:23.5465097Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:23.5465504Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:23.5465919Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:25:23.5466318Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:23.5466706Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:23.5467108Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:23.5467514Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:23.5467902Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:23.5468327Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:25:23.5468786Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:25:23.5469238Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:25:23.5469688Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:23.5470147Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:25:23.5470607Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:25:23.5471060Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:25:23.5471521Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:25:23.5471984Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:25:23.5472450Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:25:23.5472932Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:25:23.5473410Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:23.5473885Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:23.5474339Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:23.5474792Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:23.5475350Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:23.5475836Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:23.5476254Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:25:23.5476689Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:23.5477118Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:23.5477520Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:23.5477934Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:25:23.5478367Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:25:23.5478882Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:23.5479319Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:25:23.5479789Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:23.5480261Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:25:23.5480730Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:23.5481193Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:25:23.5481650Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:23.5482101Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:25:23.5482514Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:23.5482947Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:23.5483380Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:23.5483818Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:23.5484233Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:23.5484662Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:23.5485108Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:23.5485528Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:23.5485941Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:23.5486341Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:23.5486781Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:25:23.5487235Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:23.5487628Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:23.5488018Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:23.5488458Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:23.5488904Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:23.5489339Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:23.5489784Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:23.5490194Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:23.5490604Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:23.5491014Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:23.5491417Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:23.5492008Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:23.5492563Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:23.5493030Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:23.5493516Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:23.5493986Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:23.5494450Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:23.5494906Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:23.5495340Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:23.5495888Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:23.5496331Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:23.5496804Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:23.5497299Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:23.5497764Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:23.5498210Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:23.5498666Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:23.5499113Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:23.5499559Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:23.5500024Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:23.5500493Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:23.5500915Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:23.5501301Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:23.5501679Z     ------------------------------------------------------------
2025-05-07T20:25:23.5502026Z                                            Total:        1.88 GB
2025-05-07T20:25:23.5502236Z 
2025-05-07T20:25:23.5502378Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:23.5502605Z 
2025-05-07T20:25:23.5502809Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:23.5503236Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:23.5503652Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:23.5504121Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:23.5504552Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:25:23.5505038Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:25:23.5505652Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:23.5506774Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:25:23.5507336Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:23.5507906Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:25:23.5508435Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:25:23.5508969Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:23.5509553Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:23.5510180Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:25:23.5510990Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:23.5511620Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:23.5512192Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5512727Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:25:23.5513251Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:25:23.5513801Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5514350Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:23.5514939Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:25:23.5515615Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:25:23.5516116Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:25:23.5516707Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:25:23.5517273Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:25:23.5517772Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:25:23.5518302Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:25:23.5518885Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:25:23.5519447Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:25:23.5520021Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:25:23.5520577Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5521123Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5521651Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:25:23.5522170Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5522673Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:25:23.5523188Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:23.5523700Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5524233Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:23.5524804Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:25:23.5525564Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:25:23.5526267Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:25:23.5526885Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5527421Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:23.5528011Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:25:23.5528574Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:25:23.5529141Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5529909Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:25:23.5530576Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:23.5531098Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:25:23.5531635Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:23.5532283Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:23.5532745Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:23.5533382Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:23.5534004Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:23.5534625Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:23.5535210Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:23.5535729Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:23.5536232Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:23.5536736Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:23.5537373Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:23.5538073Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:23.5538570Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:25:23.5539010Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:23.5539399Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:23.5539818Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:23.5540235Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:23.5540645Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:23.5541105Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:23.5541623Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:23.5542143Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:25:23.5542642Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:25:23.5553171Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:25:23.5553790Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:25:23.5554525Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:25:23.5555166Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:25:23.5555711Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:23.5556269Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:23.5556829Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:25:23.5557379Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:25:23.5557918Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:23.5558426Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:23.5558949Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:23.5559480Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:23.5559981Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:23.5560441Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:23.5560934Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:23.5561416Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:23.5561868Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:23.5562322Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:25:23.5562806Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:25:23.5563297Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:23.5563790Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:25:23.5564483Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:23.5565038Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:25:23.5565612Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:23.5566170Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:25:23.5566703Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:25:23.5567223Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:23.5567775Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:23.5568421Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:23.5569150Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:23.5569594Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:23.5570089Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:23.5570602Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:23.5571073Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:23.5571501Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:23.5572009Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:25:23.5572600Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:23.5572982Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:23.5573397Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:23.5573917Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:23.5574425Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:23.5574907Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:23.5575416Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:23.5575874Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:23.5576322Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:23.5576824Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:23.5577376Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:23.5577937Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:23.5578538Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:23.5579102Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:23.5579633Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:23.5580253Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:23.5580914Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:23.5581484Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:23.5581988Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:23.5582562Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:23.5583158Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:23.5583716Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:23.5584248Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:23.5584790Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:23.5585303Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:23.5585984Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:23.5586559Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:23.5587112Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:23.5587577Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:23.5587841Z 
2025-05-07T20:25:23.5587962Z The following packages will be UPDATED:
2025-05-07T20:25:23.5588175Z 
2025-05-07T20:25:23.5588346Z   libsqlite                               3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 
2025-05-07T20:25:23.5588769Z   libzlib                                 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:25:23.5589158Z   zlib                                    1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:25:23.5589506Z 
2025-05-07T20:25:23.5589728Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:23.5590048Z 
2025-05-07T20:25:23.5590336Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:23.5590948Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:23.5591414Z 
2025-05-07T20:25:23.5591441Z 
2025-05-07T20:25:23.5591446Z 
2025-05-07T20:25:23.5591656Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:23.5592154Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:23.5592402Z 
2025-05-07T20:25:23.5592818Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:23.5593077Z 
2025-05-07T20:25:23.5593081Z 
2025-05-07T20:25:23.5593312Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:23.5593581Z 
2025-05-07T20:25:23.5593585Z 
2025-05-07T20:25:23.5593589Z 
2025-05-07T20:25:23.5593821Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:23.5594084Z 
2025-05-07T20:25:23.5594088Z 
2025-05-07T20:25:23.5594097Z 
2025-05-07T20:25:23.5594101Z 
2025-05-07T20:25:23.5594333Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:23.5594596Z 
2025-05-07T20:25:23.5594600Z 
2025-05-07T20:25:23.5594603Z 
2025-05-07T20:25:23.5594607Z 
2025-05-07T20:25:23.5594610Z 
2025-05-07T20:25:23.5604592Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:23.5604974Z 
2025-05-07T20:25:23.5604980Z 
2025-05-07T20:25:23.5604985Z 
2025-05-07T20:25:23.5604990Z 
2025-05-07T20:25:23.5604995Z 
2025-05-07T20:25:23.5605004Z 
2025-05-07T20:25:23.5608245Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:23.5608650Z 
2025-05-07T20:25:23.5608655Z 
2025-05-07T20:25:23.5608660Z 
2025-05-07T20:25:23.5608676Z 
2025-05-07T20:25:23.5608682Z 
2025-05-07T20:25:23.5608687Z 
2025-05-07T20:25:23.5608692Z 
2025-05-07T20:25:23.5609610Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:23.5610005Z 
2025-05-07T20:25:23.5610024Z 
2025-05-07T20:25:23.5610029Z 
2025-05-07T20:25:23.5610034Z 
2025-05-07T20:25:23.5610039Z 
2025-05-07T20:25:23.5610044Z 
2025-05-07T20:25:23.5610049Z 
2025-05-07T20:25:23.5610054Z 
2025-05-07T20:25:23.5612347Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5612746Z 
2025-05-07T20:25:23.5612751Z 
2025-05-07T20:25:23.5612756Z 
2025-05-07T20:25:23.5612761Z 
2025-05-07T20:25:23.5612766Z 
2025-05-07T20:25:23.5612771Z 
2025-05-07T20:25:23.5612776Z 
2025-05-07T20:25:23.5612781Z 
2025-05-07T20:25:23.5614267Z 
2025-05-07T20:25:23.5615874Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5616212Z 
2025-05-07T20:25:23.5616215Z 
2025-05-07T20:25:23.5616219Z 
2025-05-07T20:25:23.5616230Z 
2025-05-07T20:25:23.5616234Z 
2025-05-07T20:25:23.5616244Z 
2025-05-07T20:25:23.5616247Z 
2025-05-07T20:25:23.5616251Z 
2025-05-07T20:25:23.5616254Z 
2025-05-07T20:25:23.5616258Z 
2025-05-07T20:25:23.5618636Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5618959Z 
2025-05-07T20:25:23.5618963Z 
2025-05-07T20:25:23.5618967Z 
2025-05-07T20:25:23.5618971Z 
2025-05-07T20:25:23.5618974Z 
2025-05-07T20:25:23.5618987Z 
2025-05-07T20:25:23.5618991Z 
2025-05-07T20:25:23.5618994Z 
2025-05-07T20:25:23.5618998Z 
2025-05-07T20:25:23.5619001Z 
2025-05-07T20:25:23.5619010Z 
2025-05-07T20:25:23.5621839Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5622178Z 
2025-05-07T20:25:23.5622182Z 
2025-05-07T20:25:23.5622186Z 
2025-05-07T20:25:23.5622189Z 
2025-05-07T20:25:23.5622193Z 
2025-05-07T20:25:23.5622197Z 
2025-05-07T20:25:23.5622200Z 
2025-05-07T20:25:23.5622204Z 
2025-05-07T20:25:23.5622933Z 
2025-05-07T20:25:23.5622938Z 
2025-05-07T20:25:23.5622941Z 
2025-05-07T20:25:23.5622945Z 
2025-05-07T20:25:23.5623481Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5623929Z 
2025-05-07T20:25:23.5623935Z 
2025-05-07T20:25:23.5623940Z 
2025-05-07T20:25:23.5623945Z 
2025-05-07T20:25:23.5623950Z 
2025-05-07T20:25:23.5623955Z 
2025-05-07T20:25:23.5623960Z 
2025-05-07T20:25:23.5623965Z 
2025-05-07T20:25:23.5623970Z 
2025-05-07T20:25:23.5623975Z 
2025-05-07T20:25:23.5623980Z 
2025-05-07T20:25:23.5623985Z 
2025-05-07T20:25:23.5623990Z 
2025-05-07T20:25:23.5625053Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5625494Z 
2025-05-07T20:25:23.5625500Z 
2025-05-07T20:25:23.5625505Z 
2025-05-07T20:25:23.5625510Z 
2025-05-07T20:25:23.5625515Z 
2025-05-07T20:25:23.5625520Z 
2025-05-07T20:25:23.5625525Z 
2025-05-07T20:25:23.5625530Z 
2025-05-07T20:25:23.5625535Z 
2025-05-07T20:25:23.5625555Z 
2025-05-07T20:25:23.5625560Z 
2025-05-07T20:25:23.5625565Z 
2025-05-07T20:25:23.5625570Z 
2025-05-07T20:25:23.5625575Z 
2025-05-07T20:25:23.5629171Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5629604Z 
2025-05-07T20:25:23.5629609Z 
2025-05-07T20:25:23.5629615Z 
2025-05-07T20:25:23.5629620Z 
2025-05-07T20:25:23.5629634Z 
2025-05-07T20:25:23.5629639Z 
2025-05-07T20:25:23.5629644Z 
2025-05-07T20:25:23.5629649Z 
2025-05-07T20:25:23.5629655Z 
2025-05-07T20:25:23.5629660Z 
2025-05-07T20:25:23.5629665Z 
2025-05-07T20:25:23.5629670Z 
2025-05-07T20:25:23.5629675Z 
2025-05-07T20:25:23.5629680Z 
2025-05-07T20:25:23.5629685Z 
2025-05-07T20:25:23.5630773Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5631217Z 
2025-05-07T20:25:23.5631222Z 
2025-05-07T20:25:23.5631235Z 
2025-05-07T20:25:23.5631240Z 
2025-05-07T20:25:23.5631245Z 
2025-05-07T20:25:23.5631250Z 
2025-05-07T20:25:23.5631264Z 
2025-05-07T20:25:23.5631269Z 
2025-05-07T20:25:23.5631274Z 
2025-05-07T20:25:23.5631279Z 
2025-05-07T20:25:23.5631284Z 
2025-05-07T20:25:23.5631289Z 
2025-05-07T20:25:23.5631294Z 
2025-05-07T20:25:23.5631299Z 
2025-05-07T20:25:23.5631309Z 
2025-05-07T20:25:23.5631314Z 
2025-05-07T20:25:23.5632569Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5633022Z 
2025-05-07T20:25:23.5633027Z 
2025-05-07T20:25:23.5633032Z 
2025-05-07T20:25:23.5633046Z 
2025-05-07T20:25:23.5633051Z 
2025-05-07T20:25:23.5633065Z 
2025-05-07T20:25:23.5633070Z 
2025-05-07T20:25:23.5633075Z 
2025-05-07T20:25:23.5633080Z 
2025-05-07T20:25:23.5633085Z 
2025-05-07T20:25:23.5633091Z 
2025-05-07T20:25:23.5633096Z 
2025-05-07T20:25:23.5633101Z 
2025-05-07T20:25:23.5633106Z 
2025-05-07T20:25:23.5633111Z 
2025-05-07T20:25:23.5633116Z 
2025-05-07T20:25:23.5633121Z 
2025-05-07T20:25:23.5634210Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5634658Z 
2025-05-07T20:25:23.5634663Z 
2025-05-07T20:25:23.5634677Z 
2025-05-07T20:25:23.5634682Z 
2025-05-07T20:25:23.5634687Z 
2025-05-07T20:25:23.5634692Z 
2025-05-07T20:25:23.5634840Z 
2025-05-07T20:25:23.5634847Z 
2025-05-07T20:25:23.5634852Z 
2025-05-07T20:25:23.5634857Z 
2025-05-07T20:25:23.5634862Z 
2025-05-07T20:25:23.5634867Z 
2025-05-07T20:25:23.5634872Z 
2025-05-07T20:25:23.5634877Z 
2025-05-07T20:25:23.5634882Z 
2025-05-07T20:25:23.5634896Z 
2025-05-07T20:25:23.5634901Z 
2025-05-07T20:25:23.5634907Z 
2025-05-07T20:25:23.5635938Z cuda-cupti-dev-12.8. | 4.0 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5636396Z 
2025-05-07T20:25:23.5636401Z 
2025-05-07T20:25:23.5636407Z 
2025-05-07T20:25:23.5636412Z 
2025-05-07T20:25:23.5636417Z 
2025-05-07T20:25:23.5636422Z 
2025-05-07T20:25:23.5636427Z 
2025-05-07T20:25:23.5636432Z 
2025-05-07T20:25:23.5636437Z 
2025-05-07T20:25:23.5636442Z 
2025-05-07T20:25:23.5636566Z 
2025-05-07T20:25:23.5636571Z 
2025-05-07T20:25:23.5636576Z 
2025-05-07T20:25:23.5636581Z 
2025-05-07T20:25:23.5636586Z 
2025-05-07T20:25:23.5636591Z 
2025-05-07T20:25:23.5636596Z 
2025-05-07T20:25:23.5636601Z 
2025-05-07T20:25:23.5636613Z 
2025-05-07T20:25:23.6532337Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.6532784Z 
2025-05-07T20:25:23.6540355Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:23.6540727Z 
2025-05-07T20:25:23.6540733Z 
2025-05-07T20:25:23.6569499Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:23.6569881Z 
2025-05-07T20:25:23.6569887Z 
2025-05-07T20:25:23.6569892Z 
2025-05-07T20:25:23.6569897Z 
2025-05-07T20:25:23.6605474Z libcufft-11.3.3.41   | 147.4 MB  | 5          |   6% [A[A[A[A
2025-05-07T20:25:23.6686558Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:23.6686909Z 
2025-05-07T20:25:23.6686923Z 
2025-05-07T20:25:23.6691810Z 
2025-05-07T20:25:23.7534749Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:23.7536711Z 
2025-05-07T20:25:23.7543017Z nsight-compute-2025. | 320.6 MB  | 1          |   1% [A
2025-05-07T20:25:23.7543373Z 
2025-05-07T20:25:23.7543401Z 
2025-05-07T20:25:23.7616596Z libcusparse-12.5.7.5 | 164.9 MB  | 2          |   2% [A[A
2025-05-07T20:25:23.7688396Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:23.7688752Z 
2025-05-07T20:25:23.7688758Z 
2025-05-07T20:25:23.7690093Z 
2025-05-07T20:25:23.8546105Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:23.8546509Z 
2025-05-07T20:25:23.8548349Z 
2025-05-07T20:25:23.8617627Z libcusparse-12.5.7.5 | 164.9 MB  | 4          |   5% [A[A
2025-05-07T20:25:23.8625345Z libcublas-12.8.3.14  | 460.2 MB  |            |   1% 
2025-05-07T20:25:23.8625693Z 
2025-05-07T20:25:23.8625699Z 
2025-05-07T20:25:23.8625712Z 
2025-05-07T20:25:23.8625718Z 
2025-05-07T20:25:23.8688490Z libcufft-11.3.3.41   | 147.4 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:25:23.8688881Z 
2025-05-07T20:25:23.8688886Z 
2025-05-07T20:25:23.8689422Z 
2025-05-07T20:25:23.8708856Z libcusolver-11.7.2.5 | 156.9 MB  | 5          |   5% [A[A[A
2025-05-07T20:25:23.8709351Z 
2025-05-07T20:25:23.9548807Z nsight-compute-2025. | 320.6 MB  | 1          |   2% [A
2025-05-07T20:25:23.9549167Z 
2025-05-07T20:25:23.9550455Z 
2025-05-07T20:25:23.9621572Z libcusparse-12.5.7.5 | 164.9 MB  | 7          |   7% [A[A
2025-05-07T20:25:23.9693002Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:25:23.9693272Z 
2025-05-07T20:25:23.9693276Z 
2025-05-07T20:25:23.9693280Z 
2025-05-07T20:25:23.9708573Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   8% [A[A[A
2025-05-07T20:25:23.9708872Z 
2025-05-07T20:25:24.0036179Z nsight-compute-2025. | 320.6 MB  | 2          |   3% [A
2025-05-07T20:25:24.0036472Z 
2025-05-07T20:25:24.0036477Z 
2025-05-07T20:25:24.0036482Z 
2025-05-07T20:25:24.0036487Z 
2025-05-07T20:25:24.0628347Z libcufft-11.3.3.41   | 147.4 MB  | #4         |  15% [A[A[A[A
2025-05-07T20:25:24.0665760Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   2% 
2025-05-07T20:25:24.0666064Z 
2025-05-07T20:25:24.0666069Z 
2025-05-07T20:25:24.0697307Z libcusparse-12.5.7.5 | 164.9 MB  | 9          |   9% [A[A
2025-05-07T20:25:24.0697610Z 
2025-05-07T20:25:24.0697614Z 
2025-05-07T20:25:24.0698051Z 
2025-05-07T20:25:24.0714176Z libcusolver-11.7.2.5 | 156.9 MB  | #          |  10% [A[A[A
2025-05-07T20:25:24.0714470Z 
2025-05-07T20:25:24.1483812Z nsight-compute-2025. | 320.6 MB  | 3          |   4% [A
2025-05-07T20:25:24.1484099Z 
2025-05-07T20:25:24.1484103Z 
2025-05-07T20:25:24.1484107Z 
2025-05-07T20:25:24.1485486Z 
2025-05-07T20:25:24.1633361Z libcufft-11.3.3.41   | 147.4 MB  | #8         |  18% [A[A[A[A
2025-05-07T20:25:24.1694500Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:25:24.1694770Z 
2025-05-07T20:25:24.1695712Z 
2025-05-07T20:25:24.1732681Z libcusparse-12.5.7.5 | 164.9 MB  | #1         |  11% [A[A
2025-05-07T20:25:24.1733425Z 
2025-05-07T20:25:24.1789778Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:25:24.1790062Z 
2025-05-07T20:25:24.1790067Z 
2025-05-07T20:25:24.1790072Z 
2025-05-07T20:25:24.2637243Z libcusolver-11.7.2.5 | 156.9 MB  | #2         |  12% [A[A[A
2025-05-07T20:25:24.2695435Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   4% 
2025-05-07T20:25:24.2695727Z 
2025-05-07T20:25:24.2698404Z 
2025-05-07T20:25:24.2720021Z libcusparse-12.5.7.5 | 164.9 MB  | #3         |  13% [A[A
2025-05-07T20:25:24.2720377Z 
2025-05-07T20:25:24.2720383Z 
2025-05-07T20:25:24.2720389Z 
2025-05-07T20:25:24.2720394Z 
2025-05-07T20:25:24.2792570Z libcufft-11.3.3.41   | 147.4 MB  | ##1        |  21% [A[A[A[A
2025-05-07T20:25:24.2792867Z 
2025-05-07T20:25:24.2792871Z 
2025-05-07T20:25:24.2793600Z 
2025-05-07T20:25:24.3142704Z libcusolver-11.7.2.5 | 156.9 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:24.3143002Z 
2025-05-07T20:25:24.3638578Z nsight-compute-2025. | 320.6 MB  | 5          |   6% [A
2025-05-07T20:25:24.3698102Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:25:24.3698363Z 
2025-05-07T20:25:24.3698368Z 
2025-05-07T20:25:24.3811454Z libcusparse-12.5.7.5 | 164.9 MB  | #5         |  15% [A[A
2025-05-07T20:25:24.3811851Z 
2025-05-07T20:25:24.3811858Z 
2025-05-07T20:25:24.3811863Z 
2025-05-07T20:25:24.3811868Z 
2025-05-07T20:25:24.4147144Z libcufft-11.3.3.41   | 147.4 MB  | ##4        |  24% [A[A[A[A
2025-05-07T20:25:24.4149012Z 
2025-05-07T20:25:24.4338229Z nsight-compute-2025. | 320.6 MB  | 7          |   7% [A
2025-05-07T20:25:24.4338606Z 
2025-05-07T20:25:24.4338612Z 
2025-05-07T20:25:24.4338618Z 
2025-05-07T20:25:24.4639269Z libcusolver-11.7.2.5 | 156.9 MB  | #6         |  17% [A[A[A
2025-05-07T20:25:24.4698983Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:25:24.4699246Z 
2025-05-07T20:25:24.4699258Z 
2025-05-07T20:25:24.4812297Z libcusparse-12.5.7.5 | 164.9 MB  | #7         |  18% [A[A
2025-05-07T20:25:24.4812616Z 
2025-05-07T20:25:24.4812620Z 
2025-05-07T20:25:24.4812624Z 
2025-05-07T20:25:24.4812627Z 
2025-05-07T20:25:24.5148666Z libcufft-11.3.3.41   | 147.4 MB  | ##6        |  27% [A[A[A[A
2025-05-07T20:25:24.5151095Z 
2025-05-07T20:25:24.5339710Z nsight-compute-2025. | 320.6 MB  | 8          |   8% [A
2025-05-07T20:25:24.5340143Z 
2025-05-07T20:25:24.5340150Z 
2025-05-07T20:25:24.5340155Z 
2025-05-07T20:25:24.5639969Z libcusolver-11.7.2.5 | 156.9 MB  | #8         |  19% [A[A[A
2025-05-07T20:25:24.5793682Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   6% 
2025-05-07T20:25:24.5793954Z 
2025-05-07T20:25:24.5796635Z 
2025-05-07T20:25:24.5893084Z libcusparse-12.5.7.5 | 164.9 MB  | ##         |  20% [A[A
2025-05-07T20:25:24.5893483Z 
2025-05-07T20:25:24.5893488Z 
2025-05-07T20:25:24.5893492Z 
2025-05-07T20:25:24.5897357Z 
2025-05-07T20:25:24.6149688Z libcufft-11.3.3.41   | 147.4 MB  | ##9        |  29% [A[A[A[A
2025-05-07T20:25:24.6154748Z 
2025-05-07T20:25:24.6340493Z nsight-compute-2025. | 320.6 MB  | 9          |   9% [A
2025-05-07T20:25:24.6340911Z 
2025-05-07T20:25:24.6340915Z 
2025-05-07T20:25:24.6341697Z 
2025-05-07T20:25:24.6641921Z libcusolver-11.7.2.5 | 156.9 MB  | ##1        |  21% [A[A[A
2025-05-07T20:25:24.6795404Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:25:24.6795779Z 
2025-05-07T20:25:24.6796410Z 
2025-05-07T20:25:24.6955686Z libcusparse-12.5.7.5 | 164.9 MB  | ##2        |  22% [A[A
2025-05-07T20:25:24.6956050Z 
2025-05-07T20:25:24.6956054Z 
2025-05-07T20:25:24.6956057Z 
2025-05-07T20:25:24.6956529Z 
2025-05-07T20:25:24.7154480Z libcufft-11.3.3.41   | 147.4 MB  | ###1       |  32% [A[A[A[A
2025-05-07T20:25:24.7154876Z 
2025-05-07T20:25:24.7342458Z nsight-compute-2025. | 320.6 MB  | #          |  10% [A
2025-05-07T20:25:24.7342861Z 
2025-05-07T20:25:24.7342867Z 
2025-05-07T20:25:24.7342872Z 
2025-05-07T20:25:24.7691144Z libcusolver-11.7.2.5 | 156.9 MB  | ##3        |  23% [A[A[A
2025-05-07T20:25:24.7798055Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:25:24.7798685Z 
2025-05-07T20:25:24.7798690Z 
2025-05-07T20:25:24.8064392Z libcusparse-12.5.7.5 | 164.9 MB  | ##4        |  25% [A[A
2025-05-07T20:25:24.8064747Z 
2025-05-07T20:25:24.8064752Z 
2025-05-07T20:25:24.8064777Z 
2025-05-07T20:25:24.8064783Z 
2025-05-07T20:25:24.8161312Z libcufft-11.3.3.41   | 147.4 MB  | ###4       |  35% [A[A[A[A
2025-05-07T20:25:24.8161765Z 
2025-05-07T20:25:24.8345020Z nsight-compute-2025. | 320.6 MB  | #1         |  11% [A
2025-05-07T20:25:24.8345391Z 
2025-05-07T20:25:24.8345397Z 
2025-05-07T20:25:24.8345816Z 
2025-05-07T20:25:24.8745094Z libcusolver-11.7.2.5 | 156.9 MB  | ##5        |  26% [A[A[A
2025-05-07T20:25:24.8800549Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   8% 
2025-05-07T20:25:24.8800817Z 
2025-05-07T20:25:24.8801565Z 
2025-05-07T20:25:24.9111618Z libcusparse-12.5.7.5 | 164.9 MB  | ##7        |  27% [A[A
2025-05-07T20:25:24.9111941Z 
2025-05-07T20:25:24.9111945Z 
2025-05-07T20:25:24.9111949Z 
2025-05-07T20:25:24.9111983Z 
2025-05-07T20:25:24.9175662Z libcufft-11.3.3.41   | 147.4 MB  | ###7       |  37% [A[A[A[A
2025-05-07T20:25:24.9177424Z 
2025-05-07T20:25:24.9362079Z nsight-compute-2025. | 320.6 MB  | #2         |  12% [A
2025-05-07T20:25:24.9362388Z 
2025-05-07T20:25:24.9362394Z 
2025-05-07T20:25:24.9362680Z 
2025-05-07T20:25:24.9746496Z libcusolver-11.7.2.5 | 156.9 MB  | ##8        |  28% [A[A[A
2025-05-07T20:25:24.9807359Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:25:24.9807714Z 
2025-05-07T20:25:24.9807719Z 
2025-05-07T20:25:25.0118785Z libcusparse-12.5.7.5 | 164.9 MB  | ##9        |  30% [A[A
2025-05-07T20:25:25.0119076Z 
2025-05-07T20:25:25.0119080Z 
2025-05-07T20:25:25.0119085Z 
2025-05-07T20:25:25.0119088Z 
2025-05-07T20:25:25.0177724Z libcufft-11.3.3.41   | 147.4 MB  | ###9       |  40% [A[A[A[A
2025-05-07T20:25:25.0180218Z 
2025-05-07T20:25:25.0366326Z nsight-compute-2025. | 320.6 MB  | #3         |  14% [A
2025-05-07T20:25:25.0366634Z 
2025-05-07T20:25:25.0366672Z 
2025-05-07T20:25:25.0366928Z 
2025-05-07T20:25:25.0747905Z libcusolver-11.7.2.5 | 156.9 MB  | ###        |  31% [A[A[A
2025-05-07T20:25:25.0846712Z libcublas-12.8.3.14  | 460.2 MB  | #          |  10% 
2025-05-07T20:25:25.0846973Z 
2025-05-07T20:25:25.0848595Z 
2025-05-07T20:25:25.1178872Z libcusparse-12.5.7.5 | 164.9 MB  | ###1       |  32% [A[A
2025-05-07T20:25:25.1180670Z 
2025-05-07T20:25:25.1197423Z nsight-compute-2025. | 320.6 MB  | #4         |  15% [A
2025-05-07T20:25:25.1197684Z 
2025-05-07T20:25:25.1197688Z 
2025-05-07T20:25:25.1198912Z 
2025-05-07T20:25:25.1198917Z 
2025-05-07T20:25:25.1367052Z libcufft-11.3.3.41   | 147.4 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:25:25.1367421Z 
2025-05-07T20:25:25.1367426Z 
2025-05-07T20:25:25.1367431Z 
2025-05-07T20:25:25.1749700Z libcusolver-11.7.2.5 | 156.9 MB  | ###2       |  33% [A[A[A
2025-05-07T20:25:25.1899482Z libcublas-12.8.3.14  | 460.2 MB  | #          |  11% 
2025-05-07T20:25:25.1899740Z 
2025-05-07T20:25:25.1899744Z 
2025-05-07T20:25:25.2219281Z libcusparse-12.5.7.5 | 164.9 MB  | ###4       |  34% [A[A
2025-05-07T20:25:25.2221898Z 
2025-05-07T20:25:25.2288960Z nsight-compute-2025. | 320.6 MB  | #5         |  16% [A
2025-05-07T20:25:25.2289226Z 
2025-05-07T20:25:25.2289469Z 
2025-05-07T20:25:25.2289483Z 
2025-05-07T20:25:25.2290165Z 
2025-05-07T20:25:25.2374972Z libcufft-11.3.3.41   | 147.4 MB  | ####4      |  44% [A[A[A[A
2025-05-07T20:25:25.2375247Z 
2025-05-07T20:25:25.2375251Z 
2025-05-07T20:25:25.2376949Z 
2025-05-07T20:25:25.2782628Z libcusolver-11.7.2.5 | 156.9 MB  | ###5       |  35% [A[A[A
2025-05-07T20:25:25.2901346Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:25:25.2901604Z 
2025-05-07T20:25:25.2904688Z 
2025-05-07T20:25:25.3226508Z libcusparse-12.5.7.5 | 164.9 MB  | ###6       |  36% [A[A
2025-05-07T20:25:25.3226792Z 
2025-05-07T20:25:25.3293101Z nsight-compute-2025. | 320.6 MB  | #6         |  17% [A
2025-05-07T20:25:25.3293420Z 
2025-05-07T20:25:25.3293426Z 
2025-05-07T20:25:25.3293705Z 
2025-05-07T20:25:25.3293711Z 
2025-05-07T20:25:25.3430594Z libcufft-11.3.3.41   | 147.4 MB  | ####6      |  47% [A[A[A[A
2025-05-07T20:25:25.3430885Z 
2025-05-07T20:25:25.3430890Z 
2025-05-07T20:25:25.3432247Z 
2025-05-07T20:25:25.3787249Z libcusolver-11.7.2.5 | 156.9 MB  | ###7       |  38% [A[A[A
2025-05-07T20:25:25.3949489Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  12% 
2025-05-07T20:25:25.3949774Z 
2025-05-07T20:25:25.3949779Z 
2025-05-07T20:25:25.4261535Z libcusparse-12.5.7.5 | 164.9 MB  | ###8       |  39% [A[A
2025-05-07T20:25:25.4263208Z 
2025-05-07T20:25:25.4308765Z nsight-compute-2025. | 320.6 MB  | #8         |  18% [A
2025-05-07T20:25:25.4309037Z 
2025-05-07T20:25:25.4309041Z 
2025-05-07T20:25:25.4309045Z 
2025-05-07T20:25:25.4309049Z 
2025-05-07T20:25:25.4434106Z libcufft-11.3.3.41   | 147.4 MB  | ####9      |  49% [A[A[A[A
2025-05-07T20:25:25.4434389Z 
2025-05-07T20:25:25.4434393Z 
2025-05-07T20:25:25.4434397Z 
2025-05-07T20:25:25.4791806Z libcusolver-11.7.2.5 | 156.9 MB  | ###9       |  40% [A[A[A
2025-05-07T20:25:25.4983661Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:25:25.4983918Z 
2025-05-07T20:25:25.4985898Z 
2025-05-07T20:25:25.5263234Z libcusparse-12.5.7.5 | 164.9 MB  | ####       |  41% [A[A
2025-05-07T20:25:25.5264721Z 
2025-05-07T20:25:25.5310352Z nsight-compute-2025. | 320.6 MB  | #9         |  19% [A
2025-05-07T20:25:25.5310630Z 
2025-05-07T20:25:25.5310635Z 
2025-05-07T20:25:25.5310639Z 
2025-05-07T20:25:25.5310643Z 
2025-05-07T20:25:25.5437265Z libcufft-11.3.3.41   | 147.4 MB  | #####1     |  52% [A[A[A[A
2025-05-07T20:25:25.5437553Z 
2025-05-07T20:25:25.5437557Z 
2025-05-07T20:25:25.5437561Z 
2025-05-07T20:25:25.5874142Z libcusolver-11.7.2.5 | 156.9 MB  | ####2      |  42% [A[A[A
2025-05-07T20:25:25.5986430Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  14% 
2025-05-07T20:25:25.5986694Z 
2025-05-07T20:25:25.5987972Z 
2025-05-07T20:25:25.6269670Z libcusparse-12.5.7.5 | 164.9 MB  | ####3      |  43% [A[A
2025-05-07T20:25:25.6270646Z 
2025-05-07T20:25:25.6316424Z nsight-compute-2025. | 320.6 MB  | ##         |  20% [A
2025-05-07T20:25:25.6316704Z 
2025-05-07T20:25:25.6316708Z 
2025-05-07T20:25:25.6316712Z 
2025-05-07T20:25:25.6316715Z 
2025-05-07T20:25:25.6440108Z libcufft-11.3.3.41   | 147.4 MB  | #####4     |  54% [A[A[A[A
2025-05-07T20:25:25.6440403Z 
2025-05-07T20:25:25.6440407Z 
2025-05-07T20:25:25.6440411Z 
2025-05-07T20:25:25.6894428Z libcusolver-11.7.2.5 | 156.9 MB  | ####4      |  45% [A[A[A
2025-05-07T20:25:25.6990518Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:25:25.6990772Z 
2025-05-07T20:25:25.6992925Z 
2025-05-07T20:25:25.7317546Z libcusparse-12.5.7.5 | 164.9 MB  | ####5      |  45% [A[A
2025-05-07T20:25:25.7317857Z 
2025-05-07T20:25:25.7317862Z 
2025-05-07T20:25:25.7317867Z 
2025-05-07T20:25:25.7320128Z 
2025-05-07T20:25:25.7357738Z libcufft-11.3.3.41   | 147.4 MB  | #####6     |  57% [A[A[A[A
2025-05-07T20:25:25.7358015Z 
2025-05-07T20:25:25.7463397Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:25:25.7463730Z 
2025-05-07T20:25:25.7463736Z 
2025-05-07T20:25:25.7464532Z 
2025-05-07T20:25:25.7896754Z libcusolver-11.7.2.5 | 156.9 MB  | ####7      |  47% [A[A[A
2025-05-07T20:25:25.8134119Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  16% 
2025-05-07T20:25:25.8134397Z 
2025-05-07T20:25:25.8134889Z 
2025-05-07T20:25:25.8371861Z libcusparse-12.5.7.5 | 164.9 MB  | ####7      |  48% [A[A
2025-05-07T20:25:25.8372354Z 
2025-05-07T20:25:25.8372372Z 
2025-05-07T20:25:25.8372378Z 
2025-05-07T20:25:25.8373211Z 
2025-05-07T20:25:25.8384946Z libcufft-11.3.3.41   | 147.4 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:25:25.8387553Z 
2025-05-07T20:25:25.8501870Z nsight-compute-2025. | 320.6 MB  | ##2        |  22% [A
2025-05-07T20:25:25.8502161Z 
2025-05-07T20:25:25.8502167Z 
2025-05-07T20:25:25.8502173Z 
2025-05-07T20:25:25.8934473Z libcusolver-11.7.2.5 | 156.9 MB  | ####9      |  49% [A[A[A
2025-05-07T20:25:25.9135157Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:25:25.9135745Z 
2025-05-07T20:25:25.9136406Z 
2025-05-07T20:25:25.9386325Z libcusparse-12.5.7.5 | 164.9 MB  | ####9      |  50% [A[A
2025-05-07T20:25:25.9386629Z 
2025-05-07T20:25:25.9430384Z nsight-compute-2025. | 320.6 MB  | ##3        |  23% [A
2025-05-07T20:25:25.9430785Z 
2025-05-07T20:25:25.9430789Z 
2025-05-07T20:25:25.9430793Z 
2025-05-07T20:25:25.9430801Z 
2025-05-07T20:25:25.9537916Z libcufft-11.3.3.41   | 147.4 MB  | ######1    |  62% [A[A[A[A
2025-05-07T20:25:25.9538211Z 
2025-05-07T20:25:25.9538215Z 
2025-05-07T20:25:25.9538219Z 
2025-05-07T20:25:25.9938312Z libcusolver-11.7.2.5 | 156.9 MB  | #####1     |  52% [A[A[A
2025-05-07T20:25:26.0151015Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  17% 
2025-05-07T20:25:26.0151293Z 
2025-05-07T20:25:26.0153420Z 
2025-05-07T20:25:26.0389573Z libcusparse-12.5.7.5 | 164.9 MB  | #####2     |  52% [A[A
2025-05-07T20:25:26.0391987Z 
2025-05-07T20:25:26.0433365Z nsight-compute-2025. | 320.6 MB  | ##4        |  24% [A
2025-05-07T20:25:26.0433686Z 
2025-05-07T20:25:26.0433690Z 
2025-05-07T20:25:26.0433694Z 
2025-05-07T20:25:26.0436167Z 
2025-05-07T20:25:26.0582687Z libcufft-11.3.3.41   | 147.4 MB  | ######4    |  64% [A[A[A[A
2025-05-07T20:25:26.0582988Z 
2025-05-07T20:25:26.0582993Z 
2025-05-07T20:25:26.0582997Z 
2025-05-07T20:25:26.0940653Z libcusolver-11.7.2.5 | 156.9 MB  | #####4     |  54% [A[A[A
2025-05-07T20:25:26.1167374Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:25:26.1167647Z 
2025-05-07T20:25:26.1169837Z 
2025-05-07T20:25:26.1450574Z libcusparse-12.5.7.5 | 164.9 MB  | #####4     |  54% [A[A
2025-05-07T20:25:26.1450888Z 
2025-05-07T20:25:26.1450894Z 
2025-05-07T20:25:26.1450900Z 
2025-05-07T20:25:26.1452822Z 
2025-05-07T20:25:26.1467028Z libcufft-11.3.3.41   | 147.4 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:25:26.1467984Z 
2025-05-07T20:25:26.1586773Z nsight-compute-2025. | 320.6 MB  | ##5        |  26% [A
2025-05-07T20:25:26.1587192Z 
2025-05-07T20:25:26.1587227Z 
2025-05-07T20:25:26.1588651Z 
2025-05-07T20:25:26.1942468Z libcusolver-11.7.2.5 | 156.9 MB  | #####6     |  56% [A[A[A
2025-05-07T20:25:26.2262134Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  19% 
2025-05-07T20:25:26.2262429Z 
2025-05-07T20:25:26.2262596Z 
2025-05-07T20:25:26.2473069Z libcusparse-12.5.7.5 | 164.9 MB  | #####6     |  56% [A[A
2025-05-07T20:25:26.2473792Z 
2025-05-07T20:25:26.2473801Z 
2025-05-07T20:25:26.2473809Z 
2025-05-07T20:25:26.2474015Z 
2025-05-07T20:25:26.2506416Z libcufft-11.3.3.41   | 147.4 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:25:26.2511664Z 
2025-05-07T20:25:26.2743357Z nsight-compute-2025. | 320.6 MB  | ##6        |  27% [A
2025-05-07T20:25:26.2743636Z 
2025-05-07T20:25:26.2743641Z 
2025-05-07T20:25:26.2743666Z 
2025-05-07T20:25:26.2942520Z libcusolver-11.7.2.5 | 156.9 MB  | #####8     |  59% [A[A[A
2025-05-07T20:25:26.3268981Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  20% 
2025-05-07T20:25:26.3269284Z 
2025-05-07T20:25:26.3270237Z 
2025-05-07T20:25:26.3507647Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  59% [A[A
2025-05-07T20:25:26.3508000Z 
2025-05-07T20:25:26.3508004Z 
2025-05-07T20:25:26.3508008Z 
2025-05-07T20:25:26.3510124Z 
2025-05-07T20:25:26.3635447Z libcufft-11.3.3.41   | 147.4 MB  | #######1   |  71% [A[A[A[A
2025-05-07T20:25:26.3635805Z 
2025-05-07T20:25:26.3771787Z nsight-compute-2025. | 320.6 MB  | ##7        |  28% [A
2025-05-07T20:25:26.3772141Z 
2025-05-07T20:25:26.3772145Z 
2025-05-07T20:25:26.3772870Z 
2025-05-07T20:25:26.3953495Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  61% [A[A[A
2025-05-07T20:25:26.4273060Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  20% 
2025-05-07T20:25:26.4273399Z 
2025-05-07T20:25:26.4273406Z 
2025-05-07T20:25:26.4508672Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:25:26.4509310Z 
2025-05-07T20:25:26.4509327Z 
2025-05-07T20:25:26.4509332Z 
2025-05-07T20:25:26.4510082Z 
2025-05-07T20:25:26.4748207Z libcufft-11.3.3.41   | 147.4 MB  | #######3   |  74% [A[A[A[A
2025-05-07T20:25:26.4748839Z 
2025-05-07T20:25:26.4904325Z nsight-compute-2025. | 320.6 MB  | ##8        |  29% [A
2025-05-07T20:25:26.4904659Z 
2025-05-07T20:25:26.4904664Z 
2025-05-07T20:25:26.4905840Z 
2025-05-07T20:25:26.4958588Z libcusolver-11.7.2.5 | 156.9 MB  | ######3    |  63% [A[A[A
2025-05-07T20:25:26.5273263Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  21% 
2025-05-07T20:25:26.5273583Z 
2025-05-07T20:25:26.5274931Z 
2025-05-07T20:25:26.5510989Z libcusparse-12.5.7.5 | 164.9 MB  | ######3    |  63% [A[A
2025-05-07T20:25:26.5511454Z 
2025-05-07T20:25:26.5511459Z 
2025-05-07T20:25:26.5511464Z 
2025-05-07T20:25:26.5511896Z 
2025-05-07T20:25:26.5979361Z libcufft-11.3.3.41   | 147.4 MB  | #######6   |  76% [A[A[A[A
2025-05-07T20:25:26.5981221Z 
2025-05-07T20:25:26.5986215Z nsight-compute-2025. | 320.6 MB  | ##9        |  30% [A
2025-05-07T20:25:26.6011924Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:25:26.6012343Z 
2025-05-07T20:25:26.6012383Z 
2025-05-07T20:25:26.6012397Z 
2025-05-07T20:25:26.6273861Z libcusolver-11.7.2.5 | 156.9 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:26.6274287Z 
2025-05-07T20:25:26.6274293Z 
2025-05-07T20:25:26.6517129Z libcusparse-12.5.7.5 | 164.9 MB  | ######5    |  65% [A[A
2025-05-07T20:25:26.6517418Z 
2025-05-07T20:25:26.6517422Z 
2025-05-07T20:25:26.6517425Z 
2025-05-07T20:25:26.6517429Z 
2025-05-07T20:25:26.7000815Z libcufft-11.3.3.41   | 147.4 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:25:26.7001918Z 
2025-05-07T20:25:26.7049468Z nsight-compute-2025. | 320.6 MB  | ###        |  30% [A
2025-05-07T20:25:26.7170044Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  23% 
2025-05-07T20:25:26.7171162Z 
2025-05-07T20:25:26.7171168Z 
2025-05-07T20:25:26.7172316Z 
2025-05-07T20:25:26.7279691Z libcusolver-11.7.2.5 | 156.9 MB  | ######7    |  67% [A[A[A
2025-05-07T20:25:26.7279986Z 
2025-05-07T20:25:26.7280884Z 
2025-05-07T20:25:26.7561393Z libcusparse-12.5.7.5 | 164.9 MB  | ######7    |  68% [A[A
2025-05-07T20:25:26.7561842Z 
2025-05-07T20:25:26.7561848Z 
2025-05-07T20:25:26.7561853Z 
2025-05-07T20:25:26.7561858Z 
2025-05-07T20:25:26.8051023Z libcufft-11.3.3.41   | 147.4 MB  | ########1  |  81% [A[A[A[A
2025-05-07T20:25:26.8073239Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:25:26.8076622Z 
2025-05-07T20:25:26.8259585Z nsight-compute-2025. | 320.6 MB  | ###1       |  31% [A
2025-05-07T20:25:26.8259863Z 
2025-05-07T20:25:26.8259867Z 
2025-05-07T20:25:26.8259871Z 
2025-05-07T20:25:26.8353890Z libcusolver-11.7.2.5 | 156.9 MB  | ######9    |  69% [A[A[A
2025-05-07T20:25:26.8354177Z 
2025-05-07T20:25:26.8354181Z 
2025-05-07T20:25:26.8616047Z libcusparse-12.5.7.5 | 164.9 MB  | ######9    |  70% [A[A
2025-05-07T20:25:26.8616366Z 
2025-05-07T20:25:26.8616370Z 
2025-05-07T20:25:26.8616374Z 
2025-05-07T20:25:26.8619285Z 
2025-05-07T20:25:26.9067318Z libcufft-11.3.3.41   | 147.4 MB  | ########3  |  84% [A[A[A[A
2025-05-07T20:25:26.9082113Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  24% 
2025-05-07T20:25:26.9084963Z 
2025-05-07T20:25:26.9305794Z nsight-compute-2025. | 320.6 MB  | ###2       |  32% [A
2025-05-07T20:25:26.9306502Z 
2025-05-07T20:25:26.9306510Z 
2025-05-07T20:25:26.9306849Z 
2025-05-07T20:25:26.9357060Z libcusolver-11.7.2.5 | 156.9 MB  | #######1   |  71% [A[A[A
2025-05-07T20:25:26.9357568Z 
2025-05-07T20:25:26.9362234Z 
2025-05-07T20:25:26.9618909Z libcusparse-12.5.7.5 | 164.9 MB  | #######1   |  72% [A[A
2025-05-07T20:25:26.9619232Z 
2025-05-07T20:25:26.9619238Z 
2025-05-07T20:25:26.9619244Z 
2025-05-07T20:25:26.9619248Z 
2025-05-07T20:25:27.0069598Z libcufft-11.3.3.41   | 147.4 MB  | ########6  |  86% [A[A[A[A
2025-05-07T20:25:27.0307712Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:25:27.0307972Z 
2025-05-07T20:25:27.0307976Z 
2025-05-07T20:25:27.0308031Z 
2025-05-07T20:25:27.0357696Z libcusolver-11.7.2.5 | 156.9 MB  | #######3   |  73% [A[A[A
2025-05-07T20:25:27.0358010Z 
2025-05-07T20:25:27.0358202Z 
2025-05-07T20:25:27.0621525Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  74% [A[A
2025-05-07T20:25:27.0621887Z 
2025-05-07T20:25:27.0621891Z 
2025-05-07T20:25:27.0621895Z 
2025-05-07T20:25:27.0624308Z 
2025-05-07T20:25:27.0644216Z libcufft-11.3.3.41   | 147.4 MB  | ########9  |  89% [A[A[A[A
2025-05-07T20:25:27.0650406Z 
2025-05-07T20:25:27.1069329Z nsight-compute-2025. | 320.6 MB  | ###3       |  33% [A
2025-05-07T20:25:27.1310209Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:25:27.1310576Z 
2025-05-07T20:25:27.1310582Z 
2025-05-07T20:25:27.1312076Z 
2025-05-07T20:25:27.1358907Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  75% [A[A[A
2025-05-07T20:25:27.1359190Z 
2025-05-07T20:25:27.1359198Z 
2025-05-07T20:25:27.1652335Z libcusparse-12.5.7.5 | 164.9 MB  | #######7   |  77% [A[A
2025-05-07T20:25:27.1652651Z 
2025-05-07T20:25:27.1652655Z 
2025-05-07T20:25:27.1652659Z 
2025-05-07T20:25:27.1652662Z 
2025-05-07T20:25:27.1891992Z libcufft-11.3.3.41   | 147.4 MB  | #########1 |  92% [A[A[A[A
2025-05-07T20:25:27.1896019Z 
2025-05-07T20:25:27.2137859Z nsight-compute-2025. | 320.6 MB  | ###3       |  34% [A
2025-05-07T20:25:27.2372240Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  27% 
2025-05-07T20:25:27.2372510Z 
2025-05-07T20:25:27.2373720Z 
2025-05-07T20:25:27.2386844Z libcusparse-12.5.7.5 | 164.9 MB  | #######9   |  79% [A[A
2025-05-07T20:25:27.2387135Z 
2025-05-07T20:25:27.2387141Z 
2025-05-07T20:25:27.2391155Z 
2025-05-07T20:25:27.2653787Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  77% [A[A[A
2025-05-07T20:25:27.2654391Z 
2025-05-07T20:25:27.2654398Z 
2025-05-07T20:25:27.2654404Z 
2025-05-07T20:25:27.2656519Z 
2025-05-07T20:25:27.2892819Z libcufft-11.3.3.41   | 147.4 MB  | #########4 |  95% [A[A[A[A
2025-05-07T20:25:27.2894712Z 
2025-05-07T20:25:27.3145765Z nsight-compute-2025. | 320.6 MB  | ###4       |  35% [A
2025-05-07T20:25:27.3469669Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  28% 
2025-05-07T20:25:27.3470046Z 
2025-05-07T20:25:27.3470052Z 
2025-05-07T20:25:27.3471212Z 
2025-05-07T20:25:27.3476024Z libcusolver-11.7.2.5 | 156.9 MB  | #######9   |  79% [A[A[A
2025-05-07T20:25:27.3476312Z 
2025-05-07T20:25:27.3477467Z 
2025-05-07T20:25:27.3656501Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  82% [A[A
2025-05-07T20:25:27.3656841Z 
2025-05-07T20:25:27.3656845Z 
2025-05-07T20:25:27.3656849Z 
2025-05-07T20:25:27.3659376Z 
2025-05-07T20:25:27.3897667Z libcufft-11.3.3.41   | 147.4 MB  | #########7 |  97% [A[A[A[A
2025-05-07T20:25:27.3899260Z 
2025-05-07T20:25:27.4147175Z nsight-compute-2025. | 320.6 MB  | ###5       |  36% [A
2025-05-07T20:25:27.4517142Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  29% 
2025-05-07T20:25:27.4517410Z 
2025-05-07T20:25:27.4520063Z 
2025-05-07T20:25:27.4601910Z libcusparse-12.5.7.5 | 164.9 MB  | ########4  |  84% [A[A
2025-05-07T20:25:27.4602212Z 
2025-05-07T20:25:27.4602216Z 
2025-05-07T20:25:27.4602821Z 
2025-05-07T20:25:27.4684905Z libcusolver-11.7.2.5 | 156.9 MB  | ########1  |  81% [A[A[A
2025-05-07T20:25:27.4685375Z 
2025-05-07T20:25:27.4685381Z 
2025-05-07T20:25:27.4685386Z 
2025-05-07T20:25:27.4686171Z 
2025-05-07T20:25:27.4901699Z libcufft-11.3.3.41   | 147.4 MB  | #########9 | 100% [A[A[A[A
2025-05-07T20:25:27.4901992Z 
2025-05-07T20:25:27.5149316Z nsight-compute-2025. | 320.6 MB  | ###6       |  36% [A
2025-05-07T20:25:27.5520867Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  30% 
2025-05-07T20:25:27.5521200Z 
2025-05-07T20:25:27.5521206Z 
2025-05-07T20:25:27.5609253Z libcusparse-12.5.7.5 | 164.9 MB  | ########6  |  86% [A[A
2025-05-07T20:25:27.5609564Z 
2025-05-07T20:25:27.5609568Z 
2025-05-07T20:25:27.5609571Z 
2025-05-07T20:25:27.5905433Z libcusolver-11.7.2.5 | 156.9 MB  | ########3  |  83% [A[A[A
2025-05-07T20:25:27.5910930Z 
2025-05-07T20:25:27.6174833Z nsight-compute-2025. | 320.6 MB  | ###7       |  37% [A
2025-05-07T20:25:27.6521574Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  30% 
2025-05-07T20:25:27.6522254Z 
2025-05-07T20:25:27.6522515Z 
2025-05-07T20:25:27.6679898Z libcusparse-12.5.7.5 | 164.9 MB  | ########8  |  89% [A[A
2025-05-07T20:25:27.6680602Z 
2025-05-07T20:25:27.6680608Z 
2025-05-07T20:25:27.6682259Z 
2025-05-07T20:25:27.6909999Z libcusolver-11.7.2.5 | 156.9 MB  | ########5  |  85% [A[A[A
2025-05-07T20:25:27.6910295Z 
2025-05-07T20:25:27.7176991Z nsight-compute-2025. | 320.6 MB  | ###8       |  38% [A
2025-05-07T20:25:27.7525960Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  31% 
2025-05-07T20:25:27.7526220Z 
2025-05-07T20:25:27.7528515Z 
2025-05-07T20:25:27.7683524Z libcusparse-12.5.7.5 | 164.9 MB  | #########1 |  92% [A[A
2025-05-07T20:25:27.7683872Z 
2025-05-07T20:25:27.7683879Z 
2025-05-07T20:25:27.7689104Z 
2025-05-07T20:25:27.7913620Z libcusolver-11.7.2.5 | 156.9 MB  | ########7  |  87% [A[A[A
2025-05-07T20:25:27.7914711Z 
2025-05-07T20:25:27.8178275Z nsight-compute-2025. | 320.6 MB  | ###9       |  39% [A
2025-05-07T20:25:27.8546936Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  32% 
2025-05-07T20:25:27.8547283Z 
2025-05-07T20:25:27.8547292Z 
2025-05-07T20:25:27.8684724Z libcusparse-12.5.7.5 | 164.9 MB  | #########3 |  94% [A[A
2025-05-07T20:25:27.8685067Z 
2025-05-07T20:25:27.8685071Z 
2025-05-07T20:25:27.8686899Z 
2025-05-07T20:25:27.8917030Z libcusolver-11.7.2.5 | 156.9 MB  | ########9  |  89% [A[A[A
2025-05-07T20:25:27.8917318Z 
2025-05-07T20:25:27.9232184Z nsight-compute-2025. | 320.6 MB  | ####       |  40% [A
2025-05-07T20:25:27.9548625Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:25:27.9548993Z 
2025-05-07T20:25:27.9549002Z 
2025-05-07T20:25:27.9684731Z libcusparse-12.5.7.5 | 164.9 MB  | #########6 |  96% [A[A
2025-05-07T20:25:27.9685019Z 
2025-05-07T20:25:27.9685023Z 
2025-05-07T20:25:27.9686736Z 
2025-05-07T20:25:27.9920433Z libcusolver-11.7.2.5 | 156.9 MB  | #########1 |  91% [A[A[A
2025-05-07T20:25:27.9920765Z 
2025-05-07T20:25:28.0286678Z nsight-compute-2025. | 320.6 MB  | ####1      |  41% [A
2025-05-07T20:25:28.0631516Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  34% 
2025-05-07T20:25:28.0631900Z 
2025-05-07T20:25:28.0633213Z 
2025-05-07T20:25:28.0686339Z libcusparse-12.5.7.5 | 164.9 MB  | #########8 |  99% [A[A
2025-05-07T20:25:28.0686634Z 
2025-05-07T20:25:28.0686639Z 
2025-05-07T20:25:28.0689158Z 
2025-05-07T20:25:28.0923517Z libcusolver-11.7.2.5 | 156.9 MB  | #########3 |  93% [A[A[A
2025-05-07T20:25:28.0923982Z 
2025-05-07T20:25:28.1309779Z nsight-compute-2025. | 320.6 MB  | ####2      |  42% [A
2025-05-07T20:25:28.1690090Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:25:28.1690471Z 
2025-05-07T20:25:28.1690475Z 
2025-05-07T20:25:28.1691970Z 
2025-05-07T20:25:28.1923892Z libcusolver-11.7.2.5 | 156.9 MB  | #########5 |  96% [A[A[A
2025-05-07T20:25:28.1924196Z 
2025-05-07T20:25:28.2311174Z nsight-compute-2025. | 320.6 MB  | ####3      |  43% [A
2025-05-07T20:25:28.2917622Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  36% 
2025-05-07T20:25:28.2917940Z 
2025-05-07T20:25:28.2917946Z 
2025-05-07T20:25:28.2917952Z 
2025-05-07T20:25:28.2927220Z libcusolver-11.7.2.5 | 156.9 MB  | #########8 |  98% [A[A[A
2025-05-07T20:25:28.2927549Z 
2025-05-07T20:25:28.3311848Z nsight-compute-2025. | 320.6 MB  | ####4      |  45% [A
2025-05-07T20:25:28.3928855Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:25:28.3929483Z 
2025-05-07T20:25:28.4313511Z nsight-compute-2025. | 320.6 MB  | ####6      |  46% [A
2025-05-07T20:25:28.4929501Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  38% 
2025-05-07T20:25:28.4929807Z 
2025-05-07T20:25:28.5313768Z nsight-compute-2025. | 320.6 MB  | ####7      |  48% [A
2025-05-07T20:25:28.5931455Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  39% 
2025-05-07T20:25:28.5932473Z 
2025-05-07T20:25:28.6316396Z nsight-compute-2025. | 320.6 MB  | ####9      |  50% [A
2025-05-07T20:25:28.6934144Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  41% 
2025-05-07T20:25:28.6934704Z 
2025-05-07T20:25:28.7317188Z nsight-compute-2025. | 320.6 MB  | #####1     |  51% [A
2025-05-07T20:25:28.7937757Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  42% 
2025-05-07T20:25:28.7941835Z 
2025-05-07T20:25:28.8338784Z nsight-compute-2025. | 320.6 MB  | #####2     |  53% [A
2025-05-07T20:25:28.8942231Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  43% 
2025-05-07T20:25:28.8942842Z 
2025-05-07T20:25:28.9340684Z nsight-compute-2025. | 320.6 MB  | #####4     |  54% [A
2025-05-07T20:25:29.0342613Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  44% 
2025-05-07T20:25:29.1343461Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:25:29.2348895Z libcublas-12.8.3.14  | 460.2 MB  | ####7      |  48% 
2025-05-07T20:25:29.2373595Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  50% 
2025-05-07T20:25:29.2373886Z 
2025-05-07T20:25:29.3375016Z nsight-compute-2025. | 320.6 MB  | #####6     |  56% [A
2025-05-07T20:25:29.3375460Z 
2025-05-07T20:25:29.3678246Z nsight-compute-2025. | 320.6 MB  | #####7     |  58% [A
2025-05-07T20:25:29.4374431Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  51% 
2025-05-07T20:25:29.4376036Z 
2025-05-07T20:25:29.4950585Z nsight-compute-2025. | 320.6 MB  | #####9     |  59% [A
2025-05-07T20:25:29.5377419Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  53% 
2025-05-07T20:25:29.5377768Z 
2025-05-07T20:25:29.6059388Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:25:29.6378837Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  54% 
2025-05-07T20:25:29.6379232Z 
2025-05-07T20:25:29.7180814Z nsight-compute-2025. | 320.6 MB  | ######2    |  63% [A
2025-05-07T20:25:29.7378999Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  55% 
2025-05-07T20:25:29.7379960Z 
2025-05-07T20:25:29.8288581Z nsight-compute-2025. | 320.6 MB  | ######4    |  64% [A
2025-05-07T20:25:29.8402311Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  57% 
2025-05-07T20:25:29.8402593Z 
2025-05-07T20:25:29.9411411Z nsight-compute-2025. | 320.6 MB  | ######5    |  66% [A
2025-05-07T20:25:29.9429574Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:25:29.9429889Z 
2025-05-07T20:25:29.9909784Z nsight-compute-2025. | 320.6 MB  | ######7    |  67% [A
2025-05-07T20:25:29.9910179Z 
2025-05-07T20:25:29.9910187Z 
2025-05-07T20:25:29.9910192Z 
2025-05-07T20:25:29.9910197Z 
2025-05-07T20:25:30.0431212Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:30.0431516Z 
2025-05-07T20:25:30.0431521Z 
2025-05-07T20:25:30.0431524Z 
2025-05-07T20:25:30.0431528Z 
2025-05-07T20:25:30.0432216Z 
2025-05-07T20:25:30.0435340Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:30.0435716Z 
2025-05-07T20:25:30.0689095Z nsight-compute-2025. | 320.6 MB  | ######9    |  69% [A
2025-05-07T20:25:30.1436694Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:25:30.1436975Z 
2025-05-07T20:25:30.1437125Z 
2025-05-07T20:25:30.1437129Z 
2025-05-07T20:25:30.1437142Z 
2025-05-07T20:25:30.1437718Z 
2025-05-07T20:25:30.1754380Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:25:30.1756057Z 
2025-05-07T20:25:30.2222770Z nsight-compute-2025. | 320.6 MB  | #######    |  71% [A
2025-05-07T20:25:30.2437704Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  60% 
2025-05-07T20:25:30.2437968Z 
2025-05-07T20:25:30.2437972Z 
2025-05-07T20:25:30.2437976Z 
2025-05-07T20:25:30.2437979Z 
2025-05-07T20:25:30.2438328Z 
2025-05-07T20:25:30.3080347Z libnpp-12.3.3.65     | 130.6 MB  | 5          |   5% [A[A[A[A[A
2025-05-07T20:25:30.3083123Z 
2025-05-07T20:25:30.3442973Z nsight-compute-2025. | 320.6 MB  | #######2   |  72% [A
2025-05-07T20:25:30.3443257Z 
2025-05-07T20:25:30.3443261Z 
2025-05-07T20:25:30.3443265Z 
2025-05-07T20:25:30.3443268Z 
2025-05-07T20:25:30.3443272Z 
2025-05-07T20:25:30.3493758Z libnpp-12.3.3.65     | 130.6 MB  | 8          |   8% [A[A[A[A[A
2025-05-07T20:25:30.4326219Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  61% 
2025-05-07T20:25:30.4326483Z 
2025-05-07T20:25:30.4443919Z nsight-compute-2025. | 320.6 MB  | #######3   |  73% [A
2025-05-07T20:25:30.4444200Z 
2025-05-07T20:25:30.4444204Z 
2025-05-07T20:25:30.4444208Z 
2025-05-07T20:25:30.4444444Z 
2025-05-07T20:25:30.4444448Z 
2025-05-07T20:25:30.4643260Z libnpp-12.3.3.65     | 130.6 MB  | #1         |  11% [A[A[A[A[A
2025-05-07T20:25:30.5446813Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  62% 
2025-05-07T20:25:30.5447122Z 
2025-05-07T20:25:30.5447128Z 
2025-05-07T20:25:30.5447131Z 
2025-05-07T20:25:30.5447135Z 
2025-05-07T20:25:30.5447372Z 
2025-05-07T20:25:30.5477927Z libnpp-12.3.3.65     | 130.6 MB  | #3         |  14% [A[A[A[A[A
2025-05-07T20:25:30.5481189Z 
2025-05-07T20:25:30.5780623Z nsight-compute-2025. | 320.6 MB  | #######4   |  75% [A
2025-05-07T20:25:30.6447867Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:25:30.6448140Z 
2025-05-07T20:25:30.6448147Z 
2025-05-07T20:25:30.6448151Z 
2025-05-07T20:25:30.6448154Z 
2025-05-07T20:25:30.6450858Z 
2025-05-07T20:25:30.6478854Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:25:30.6480667Z 
2025-05-07T20:25:30.6951296Z nsight-compute-2025. | 320.6 MB  | #######5   |  76% [A
2025-05-07T20:25:30.7448321Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:25:30.7448692Z 
2025-05-07T20:25:30.7448698Z 
2025-05-07T20:25:30.7448704Z 
2025-05-07T20:25:30.7448709Z 
2025-05-07T20:25:30.7448746Z 
2025-05-07T20:25:30.7579550Z libnpp-12.3.3.65     | 130.6 MB  | #9         |  20% [A[A[A[A[A
2025-05-07T20:25:30.7581092Z 
2025-05-07T20:25:30.7952689Z nsight-compute-2025. | 320.6 MB  | #######7   |  77% [A
2025-05-07T20:25:30.8507398Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  65% 
2025-05-07T20:25:30.8507720Z 
2025-05-07T20:25:30.8507725Z 
2025-05-07T20:25:30.8507730Z 
2025-05-07T20:25:30.8507734Z 
2025-05-07T20:25:30.8511307Z 
2025-05-07T20:25:30.8581375Z libnpp-12.3.3.65     | 130.6 MB  | ##2        |  22% [A[A[A[A[A
2025-05-07T20:25:30.8583821Z 
2025-05-07T20:25:30.8990625Z nsight-compute-2025. | 320.6 MB  | #######8   |  78% [A
2025-05-07T20:25:30.9516449Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  66% 
2025-05-07T20:25:30.9516848Z 
2025-05-07T20:25:30.9516854Z 
2025-05-07T20:25:30.9516859Z 
2025-05-07T20:25:30.9516865Z 
2025-05-07T20:25:30.9517258Z 
2025-05-07T20:25:30.9638926Z libnpp-12.3.3.65     | 130.6 MB  | ##5        |  25% [A[A[A[A[A
2025-05-07T20:25:30.9643244Z 
2025-05-07T20:25:30.9990536Z nsight-compute-2025. | 320.6 MB  | #######9   |  80% [A
2025-05-07T20:25:31.0518738Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:25:31.0519146Z 
2025-05-07T20:25:31.0519153Z 
2025-05-07T20:25:31.0519158Z 
2025-05-07T20:25:31.0519163Z 
2025-05-07T20:25:31.0519169Z 
2025-05-07T20:25:31.0707081Z libnpp-12.3.3.65     | 130.6 MB  | ##7        |  28% [A[A[A[A[A
2025-05-07T20:25:31.0707952Z 
2025-05-07T20:25:31.1075653Z nsight-compute-2025. | 320.6 MB  | ########   |  81% [A
2025-05-07T20:25:31.1525346Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  67% 
2025-05-07T20:25:31.1525613Z 
2025-05-07T20:25:31.1525617Z 
2025-05-07T20:25:31.1525621Z 
2025-05-07T20:25:31.1525625Z 
2025-05-07T20:25:31.1527194Z 
2025-05-07T20:25:31.1780838Z libnpp-12.3.3.65     | 130.6 MB  | ###        |  31% [A[A[A[A[A
2025-05-07T20:25:31.1784078Z 
2025-05-07T20:25:31.2041536Z nsight-compute-2025. | 320.6 MB  | ########1  |  82% [A
2025-05-07T20:25:31.2041812Z 
2025-05-07T20:25:31.2042069Z 
2025-05-07T20:25:31.2107888Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:31.2588930Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  68% 
2025-05-07T20:25:31.2589225Z 
2025-05-07T20:25:31.2589230Z 
2025-05-07T20:25:31.2589234Z 
2025-05-07T20:25:31.2589238Z 
2025-05-07T20:25:31.2589241Z 
2025-05-07T20:25:31.2590705Z 
2025-05-07T20:25:31.2623124Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:31.2623546Z 
2025-05-07T20:25:31.2623551Z 
2025-05-07T20:25:31.2623554Z 
2025-05-07T20:25:31.2623558Z 
2025-05-07T20:25:31.2623562Z 
2025-05-07T20:25:31.2910471Z libnpp-12.3.3.65     | 130.6 MB  | ###3       |  34% [A[A[A[A[A
2025-05-07T20:25:31.2911264Z 
2025-05-07T20:25:31.3233962Z nsight-compute-2025. | 320.6 MB  | ########3  |  83% [A
2025-05-07T20:25:31.3592336Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:25:31.3592678Z 
2025-05-07T20:25:31.3592694Z 
2025-05-07T20:25:31.3592698Z 
2025-05-07T20:25:31.3592725Z 
2025-05-07T20:25:31.3592729Z 
2025-05-07T20:25:31.3595077Z 
2025-05-07T20:25:31.3733778Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   2% [A[A[A[A[A[A
2025-05-07T20:25:31.3734178Z 
2025-05-07T20:25:31.3734187Z 
2025-05-07T20:25:31.3734191Z 
2025-05-07T20:25:31.3734196Z 
2025-05-07T20:25:31.3739928Z 
2025-05-07T20:25:31.4169288Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  36% [A[A[A[A[A
2025-05-07T20:25:31.4174430Z 
2025-05-07T20:25:31.4471318Z nsight-compute-2025. | 320.6 MB  | ########4  |  84% [A
2025-05-07T20:25:31.4471603Z 
2025-05-07T20:25:31.4471608Z 
2025-05-07T20:25:31.4475411Z 
2025-05-07T20:25:31.4565911Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:31.4592739Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:25:31.4593143Z 
2025-05-07T20:25:31.4593151Z 
2025-05-07T20:25:31.4593156Z 
2025-05-07T20:25:31.4593161Z 
2025-05-07T20:25:31.4593167Z 
2025-05-07T20:25:31.4597211Z 
2025-05-07T20:25:31.4994977Z cuda-nsight-12.8.55  | 113.2 MB  | 4          |   5% [A[A[A[A[A[A
2025-05-07T20:25:31.4995379Z 
2025-05-07T20:25:31.4995385Z 
2025-05-07T20:25:31.4995391Z 
2025-05-07T20:25:31.4995397Z 
2025-05-07T20:25:31.5000575Z 
2025-05-07T20:25:31.5110175Z libnpp-12.3.3.65     | 130.6 MB  | ###9       |  39% [A[A[A[A[A
2025-05-07T20:25:31.5110470Z 
2025-05-07T20:25:31.5110475Z 
2025-05-07T20:25:31.5110479Z 
2025-05-07T20:25:31.5110485Z 
2025-05-07T20:25:31.5110489Z 
2025-05-07T20:25:31.5110493Z 
2025-05-07T20:25:31.5110498Z 
2025-05-07T20:25:31.5407166Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:31.5407476Z 
2025-05-07T20:25:31.5597560Z nsight-compute-2025. | 320.6 MB  | ########5  |  85% [A
2025-05-07T20:25:31.5597912Z 
2025-05-07T20:25:31.5597946Z 
2025-05-07T20:25:31.5597950Z 
2025-05-07T20:25:31.5597954Z 
2025-05-07T20:25:31.5597957Z 
2025-05-07T20:25:31.5599609Z 
2025-05-07T20:25:31.5873738Z cuda-nsight-12.8.55  | 113.2 MB  | 6          |   7% [A[A[A[A[A[A
2025-05-07T20:25:31.6115182Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:25:31.6115473Z 
2025-05-07T20:25:31.6115477Z 
2025-05-07T20:25:31.6115481Z 
2025-05-07T20:25:31.6115485Z 
2025-05-07T20:25:31.6115489Z 
2025-05-07T20:25:31.6115492Z 
2025-05-07T20:25:31.6115496Z 
2025-05-07T20:25:31.6184422Z cuda-nvvp-12.8.57    | 112.4 MB  | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:31.6184734Z 
2025-05-07T20:25:31.6184739Z 
2025-05-07T20:25:31.6184743Z 
2025-05-07T20:25:31.6184746Z 
2025-05-07T20:25:31.6191007Z 
2025-05-07T20:25:31.6609602Z libnpp-12.3.3.65     | 130.6 MB  | ####1      |  42% [A[A[A[A[A
2025-05-07T20:25:31.6609915Z 
2025-05-07T20:25:31.6609919Z 
2025-05-07T20:25:31.6609923Z 
2025-05-07T20:25:31.6609927Z 
2025-05-07T20:25:31.6609965Z 
2025-05-07T20:25:31.6609969Z 
2025-05-07T20:25:31.6635970Z cuda-nsight-12.8.55  | 113.2 MB  | 8          |   9% [A[A[A[A[A[A
2025-05-07T20:25:31.6636280Z 
2025-05-07T20:25:31.7073692Z nsight-compute-2025. | 320.6 MB  | ########6  |  86% [A
2025-05-07T20:25:31.7120689Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  71% 
2025-05-07T20:25:31.7121055Z 
2025-05-07T20:25:31.7121061Z 
2025-05-07T20:25:31.7121067Z 
2025-05-07T20:25:31.7121072Z 
2025-05-07T20:25:31.7121077Z 
2025-05-07T20:25:31.7121094Z 
2025-05-07T20:25:31.7121099Z 
2025-05-07T20:25:31.7370966Z cuda-nvvp-12.8.57    | 112.4 MB  | 4          |   4% [A[A[A[A[A[A[A
2025-05-07T20:25:31.7371346Z 
2025-05-07T20:25:31.7371350Z 
2025-05-07T20:25:31.7371354Z 
2025-05-07T20:25:31.7371366Z 
2025-05-07T20:25:31.7373544Z 
2025-05-07T20:25:31.7613368Z libnpp-12.3.3.65     | 130.6 MB  | ####3      |  44% [A[A[A[A[A
2025-05-07T20:25:31.7613740Z 
2025-05-07T20:25:31.7613744Z 
2025-05-07T20:25:31.7613748Z 
2025-05-07T20:25:31.7614018Z 
2025-05-07T20:25:31.7614024Z 
2025-05-07T20:25:31.7614029Z 
2025-05-07T20:25:31.7934123Z cuda-nsight-12.8.55  | 113.2 MB  | #1         |  11% [A[A[A[A[A[A
2025-05-07T20:25:31.7934570Z 
2025-05-07T20:25:31.8133512Z nsight-compute-2025. | 320.6 MB  | ########7  |  87% [A
2025-05-07T20:25:31.8133795Z 
2025-05-07T20:25:31.8133800Z 
2025-05-07T20:25:31.8133804Z 
2025-05-07T20:25:31.8133807Z 
2025-05-07T20:25:31.8133811Z 
2025-05-07T20:25:31.8133815Z 
2025-05-07T20:25:31.8133818Z 
2025-05-07T20:25:31.8328232Z cuda-nvvp-12.8.57    | 112.4 MB  | 6          |   7% [A[A[A[A[A[A[A
2025-05-07T20:25:31.8596427Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:25:31.8596715Z 
2025-05-07T20:25:31.8596719Z 
2025-05-07T20:25:31.8596723Z 
2025-05-07T20:25:31.8596726Z 
2025-05-07T20:25:31.8599079Z 
2025-05-07T20:25:31.8612814Z libnpp-12.3.3.65     | 130.6 MB  | ####6      |  46% [A[A[A[A[A
2025-05-07T20:25:31.8613111Z 
2025-05-07T20:25:31.8613115Z 
2025-05-07T20:25:31.8613119Z 
2025-05-07T20:25:31.8613148Z 
2025-05-07T20:25:31.8613152Z 
2025-05-07T20:25:31.8613155Z 
2025-05-07T20:25:31.9006750Z cuda-nsight-12.8.55  | 113.2 MB  | #3         |  13% [A[A[A[A[A[A
2025-05-07T20:25:31.9009594Z 
2025-05-07T20:25:31.9135166Z nsight-compute-2025. | 320.6 MB  | ########7  |  88% [A
2025-05-07T20:25:31.9135559Z 
2025-05-07T20:25:31.9135563Z 
2025-05-07T20:25:31.9135577Z 
2025-05-07T20:25:31.9135581Z 
2025-05-07T20:25:31.9135585Z 
2025-05-07T20:25:31.9135588Z 
2025-05-07T20:25:31.9139207Z 
2025-05-07T20:25:31.9405874Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:25:31.9614921Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:25:31.9615319Z 
2025-05-07T20:25:31.9615325Z 
2025-05-07T20:25:31.9615330Z 
2025-05-07T20:25:31.9615335Z 
2025-05-07T20:25:31.9615340Z 
2025-05-07T20:25:31.9615345Z 
2025-05-07T20:25:31.9714090Z cuda-nsight-12.8.55  | 113.2 MB  | #5         |  16% [A[A[A[A[A[A
2025-05-07T20:25:31.9714404Z 
2025-05-07T20:25:31.9714434Z 
2025-05-07T20:25:31.9714438Z 
2025-05-07T20:25:31.9714442Z 
2025-05-07T20:25:31.9714445Z 
2025-05-07T20:25:32.0139853Z libnpp-12.3.3.65     | 130.6 MB  | ####8      |  48% [A[A[A[A[A
2025-05-07T20:25:32.0140298Z 
2025-05-07T20:25:32.0140338Z 
2025-05-07T20:25:32.0140343Z 
2025-05-07T20:25:32.0140349Z 
2025-05-07T20:25:32.0140354Z 
2025-05-07T20:25:32.0140359Z 
2025-05-07T20:25:32.0140374Z 
2025-05-07T20:25:32.0204137Z cuda-nvvp-12.8.57    | 112.4 MB  | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:32.0205942Z 
2025-05-07T20:25:32.0539033Z nsight-compute-2025. | 320.6 MB  | ########8  |  89% [A
2025-05-07T20:25:32.0616726Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:25:32.0617101Z 
2025-05-07T20:25:32.0617107Z 
2025-05-07T20:25:32.0617112Z 
2025-05-07T20:25:32.0617118Z 
2025-05-07T20:25:32.0617123Z 
2025-05-07T20:25:32.0617129Z 
2025-05-07T20:25:32.0714810Z cuda-nsight-12.8.55  | 113.2 MB  | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:25:32.0715153Z 
2025-05-07T20:25:32.0715158Z 
2025-05-07T20:25:32.0715163Z 
2025-05-07T20:25:32.0715168Z 
2025-05-07T20:25:32.0715173Z 
2025-05-07T20:25:32.1140063Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:25:32.1140611Z 
2025-05-07T20:25:32.1140621Z 
2025-05-07T20:25:32.1140626Z 
2025-05-07T20:25:32.1140631Z 
2025-05-07T20:25:32.1140636Z 
2025-05-07T20:25:32.1140641Z 
2025-05-07T20:25:32.1140646Z 
2025-05-07T20:25:32.1361027Z cuda-nvvp-12.8.57    | 112.4 MB  | #2         |  13% [A[A[A[A[A[A[A
2025-05-07T20:25:32.1364898Z 
2025-05-07T20:25:32.1626607Z nsight-compute-2025. | 320.6 MB  | ########9  |  90% [A
2025-05-07T20:25:32.1626903Z 
2025-05-07T20:25:32.1626907Z 
2025-05-07T20:25:32.1626912Z 
2025-05-07T20:25:32.1626916Z 
2025-05-07T20:25:32.1626921Z 
2025-05-07T20:25:32.1626925Z 
2025-05-07T20:25:32.1693185Z cuda-nsight-12.8.55  | 113.2 MB  | ##         |  21% [A[A[A[A[A[A
2025-05-07T20:25:32.1714554Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:25:32.1715248Z 
2025-05-07T20:25:32.1715253Z 
2025-05-07T20:25:32.1715257Z 
2025-05-07T20:25:32.1715260Z 
2025-05-07T20:25:32.1715264Z 
2025-05-07T20:25:32.2140925Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  53% [A[A[A[A[A
2025-05-07T20:25:32.2141272Z 
2025-05-07T20:25:32.2141276Z 
2025-05-07T20:25:32.2141280Z 
2025-05-07T20:25:32.2141284Z 
2025-05-07T20:25:32.2141287Z 
2025-05-07T20:25:32.2141291Z 
2025-05-07T20:25:32.2141295Z 
2025-05-07T20:25:32.2427553Z cuda-nvvp-12.8.57    | 112.4 MB  | #5         |  15% [A[A[A[A[A[A[A
2025-05-07T20:25:32.2430506Z 
2025-05-07T20:25:32.2671528Z nsight-compute-2025. | 320.6 MB  | #########  |  90% [A
2025-05-07T20:25:32.2671921Z 
2025-05-07T20:25:32.2671927Z 
2025-05-07T20:25:32.2671932Z 
2025-05-07T20:25:32.2671937Z 
2025-05-07T20:25:32.2671942Z 
2025-05-07T20:25:32.2675431Z 
2025-05-07T20:25:32.2763140Z cuda-nsight-12.8.55  | 113.2 MB  | ##2        |  23% [A[A[A[A[A[A
2025-05-07T20:25:32.2763479Z 
2025-05-07T20:25:32.2763485Z 
2025-05-07T20:25:32.2763521Z 
2025-05-07T20:25:32.2763527Z 
2025-05-07T20:25:32.2764997Z 
2025-05-07T20:25:32.2806081Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  55% [A[A[A[A[A
2025-05-07T20:25:32.3163246Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:25:32.3163582Z 
2025-05-07T20:25:32.3163588Z 
2025-05-07T20:25:32.3163594Z 
2025-05-07T20:25:32.3163599Z 
2025-05-07T20:25:32.3163604Z 
2025-05-07T20:25:32.3163610Z 
2025-05-07T20:25:32.3170133Z 
2025-05-07T20:25:32.3431743Z cuda-nvvp-12.8.57    | 112.4 MB  | #7         |  17% [A[A[A[A[A[A[A
2025-05-07T20:25:32.3432055Z 
2025-05-07T20:25:32.3673047Z nsight-compute-2025. | 320.6 MB  | #########1 |  91% [A
2025-05-07T20:25:32.3673344Z 
2025-05-07T20:25:32.3673348Z 
2025-05-07T20:25:32.3673352Z 
2025-05-07T20:25:32.3673356Z 
2025-05-07T20:25:32.3673360Z 
2025-05-07T20:25:32.3675754Z 
2025-05-07T20:25:32.3836684Z cuda-nsight-12.8.55  | 113.2 MB  | ##5        |  25% [A[A[A[A[A[A
2025-05-07T20:25:32.3837001Z 
2025-05-07T20:25:32.3837044Z 
2025-05-07T20:25:32.3837049Z 
2025-05-07T20:25:32.3837054Z 
2025-05-07T20:25:32.3837060Z 
2025-05-07T20:25:32.4099508Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:25:32.4407686Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  75% 
2025-05-07T20:25:32.4407959Z 
2025-05-07T20:25:32.4407964Z 
2025-05-07T20:25:32.4407969Z 
2025-05-07T20:25:32.4407973Z 
2025-05-07T20:25:32.4407977Z 
2025-05-07T20:25:32.4407981Z 
2025-05-07T20:25:32.4408149Z 
2025-05-07T20:25:32.4458545Z cuda-nvvp-12.8.57    | 112.4 MB  | #9         |  20% [A[A[A[A[A[A[A
2025-05-07T20:25:32.4464906Z 
2025-05-07T20:25:32.4698941Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:25:32.4699306Z 
2025-05-07T20:25:32.4699311Z 
2025-05-07T20:25:32.4699316Z 
2025-05-07T20:25:32.4699321Z 
2025-05-07T20:25:32.4699325Z 
2025-05-07T20:25:32.4703782Z 
2025-05-07T20:25:32.4842188Z cuda-nsight-12.8.55  | 113.2 MB  | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:25:32.4842638Z 
2025-05-07T20:25:32.4842681Z 
2025-05-07T20:25:32.4842689Z 
2025-05-07T20:25:32.4842695Z 
2025-05-07T20:25:32.4842701Z 
2025-05-07T20:25:32.5137644Z libnpp-12.3.3.65     | 130.6 MB  | #####8     |  59% [A[A[A[A[A
2025-05-07T20:25:32.5466696Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  75% 
2025-05-07T20:25:32.5466985Z 
2025-05-07T20:25:32.5466990Z 
2025-05-07T20:25:32.5467001Z 
2025-05-07T20:25:32.5467005Z 
2025-05-07T20:25:32.5467010Z 
2025-05-07T20:25:32.5467013Z 
2025-05-07T20:25:32.5467759Z 
2025-05-07T20:25:32.5535801Z cuda-nvvp-12.8.57    | 112.4 MB  | ##1        |  22% [A[A[A[A[A[A[A
2025-05-07T20:25:32.5536107Z 
2025-05-07T20:25:32.5702861Z nsight-compute-2025. | 320.6 MB  | #########2 |  93% [A
2025-05-07T20:25:32.5703141Z 
2025-05-07T20:25:32.5703145Z 
2025-05-07T20:25:32.5703149Z 
2025-05-07T20:25:32.5703163Z 
2025-05-07T20:25:32.5703166Z 
2025-05-07T20:25:32.5712754Z 
2025-05-07T20:25:32.6139333Z cuda-nsight-12.8.55  | 113.2 MB  | ###        |  30% [A[A[A[A[A[A
2025-05-07T20:25:32.6362215Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:25:32.6362491Z 
2025-05-07T20:25:32.6362500Z 
2025-05-07T20:25:32.6362505Z 
2025-05-07T20:25:32.6362516Z 
2025-05-07T20:25:32.6366308Z 
2025-05-07T20:25:32.6481938Z libnpp-12.3.3.65     | 130.6 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:25:32.6482233Z 
2025-05-07T20:25:32.6482237Z 
2025-05-07T20:25:32.6482247Z 
2025-05-07T20:25:32.6482251Z 
2025-05-07T20:25:32.6482255Z 
2025-05-07T20:25:32.6482258Z 
2025-05-07T20:25:32.6484290Z 
2025-05-07T20:25:32.6538017Z cuda-nvvp-12.8.57    | 112.4 MB  | ##3        |  24% [A[A[A[A[A[A[A
2025-05-07T20:25:32.6538320Z 
2025-05-07T20:25:32.6722534Z nsight-compute-2025. | 320.6 MB  | #########3 |  93% [A
2025-05-07T20:25:32.6722826Z 
2025-05-07T20:25:32.6722831Z 
2025-05-07T20:25:32.6722834Z 
2025-05-07T20:25:32.6722847Z 
2025-05-07T20:25:32.6722851Z 
2025-05-07T20:25:32.6722855Z 
2025-05-07T20:25:32.7260397Z cuda-nsight-12.8.55  | 113.2 MB  | ###2       |  33% [A[A[A[A[A[A
2025-05-07T20:25:32.7438633Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:25:32.7438906Z 
2025-05-07T20:25:32.7438910Z 
2025-05-07T20:25:32.7438914Z 
2025-05-07T20:25:32.7438918Z 
2025-05-07T20:25:32.7444249Z 
2025-05-07T20:25:32.7543427Z libnpp-12.3.3.65     | 130.6 MB  | ######2    |  63% [A[A[A[A[A
2025-05-07T20:25:32.7543755Z 
2025-05-07T20:25:32.7770473Z nsight-compute-2025. | 320.6 MB  | #########4 |  94% [A
2025-05-07T20:25:32.7770825Z 
2025-05-07T20:25:32.7770829Z 
2025-05-07T20:25:32.7770833Z 
2025-05-07T20:25:32.7770836Z 
2025-05-07T20:25:32.7770840Z 
2025-05-07T20:25:32.7770843Z 
2025-05-07T20:25:32.8260627Z cuda-nsight-12.8.55  | 113.2 MB  | ###5       |  35% [A[A[A[A[A[A
2025-05-07T20:25:32.8443901Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:25:32.8444267Z 
2025-05-07T20:25:32.8444274Z 
2025-05-07T20:25:32.8444279Z 
2025-05-07T20:25:32.8444284Z 
2025-05-07T20:25:32.8445682Z 
2025-05-07T20:25:32.8450016Z libnpp-12.3.3.65     | 130.6 MB  | ######5    |  65% [A[A[A[A[A
2025-05-07T20:25:32.8450500Z 
2025-05-07T20:25:32.8450505Z 
2025-05-07T20:25:32.8450510Z 
2025-05-07T20:25:32.8450515Z 
2025-05-07T20:25:32.8450521Z 
2025-05-07T20:25:32.8450525Z 
2025-05-07T20:25:32.8457103Z 
2025-05-07T20:25:32.8545961Z cuda-nvvp-12.8.57    | 112.4 MB  | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:25:32.8546352Z 
2025-05-07T20:25:32.8800352Z nsight-compute-2025. | 320.6 MB  | #########4 |  95% [A
2025-05-07T20:25:32.8800926Z 
2025-05-07T20:25:32.8800932Z 
2025-05-07T20:25:32.8800938Z 
2025-05-07T20:25:32.8800943Z 
2025-05-07T20:25:32.8800948Z 
2025-05-07T20:25:32.8804516Z 
2025-05-07T20:25:32.9267475Z cuda-nsight-12.8.55  | 113.2 MB  | ###7       |  37% [A[A[A[A[A[A
2025-05-07T20:25:32.9449673Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:25:32.9449945Z 
2025-05-07T20:25:32.9450030Z 
2025-05-07T20:25:32.9450034Z 
2025-05-07T20:25:32.9450037Z 
2025-05-07T20:25:32.9450132Z 
2025-05-07T20:25:32.9456641Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  67% [A[A[A[A[A
2025-05-07T20:25:32.9457078Z 
2025-05-07T20:25:32.9457085Z 
2025-05-07T20:25:32.9457090Z 
2025-05-07T20:25:32.9457096Z 
2025-05-07T20:25:32.9457101Z 
2025-05-07T20:25:32.9457106Z 
2025-05-07T20:25:32.9457367Z 
2025-05-07T20:25:32.9753292Z cuda-nvvp-12.8.57    | 112.4 MB  | ##7        |  28% [A[A[A[A[A[A[A
2025-05-07T20:25:32.9753681Z 
2025-05-07T20:25:32.9913270Z nsight-compute-2025. | 320.6 MB  | #########5 |  96% [A
2025-05-07T20:25:32.9913572Z 
2025-05-07T20:25:32.9913576Z 
2025-05-07T20:25:32.9913580Z 
2025-05-07T20:25:32.9913583Z 
2025-05-07T20:25:32.9913587Z 
2025-05-07T20:25:32.9919488Z 
2025-05-07T20:25:33.0288766Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  40% [A[A[A[A[A[A
2025-05-07T20:25:33.0458177Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:25:33.0458446Z 
2025-05-07T20:25:33.0458450Z 
2025-05-07T20:25:33.0458463Z 
2025-05-07T20:25:33.0458467Z 
2025-05-07T20:25:33.0458471Z 
2025-05-07T20:25:33.0458783Z 
2025-05-07T20:25:33.0464505Z 
2025-05-07T20:25:33.0713782Z cuda-nvvp-12.8.57    | 112.4 MB  | ##9        |  30% [A[A[A[A[A[A[A
2025-05-07T20:25:33.0714129Z 
2025-05-07T20:25:33.0714134Z 
2025-05-07T20:25:33.0714137Z 
2025-05-07T20:25:33.0714165Z 
2025-05-07T20:25:33.0714169Z 
2025-05-07T20:25:33.0753230Z libnpp-12.3.3.65     | 130.6 MB  | ######9    |  69% [A[A[A[A[A
2025-05-07T20:25:33.0753586Z 
2025-05-07T20:25:33.1031521Z nsight-compute-2025. | 320.6 MB  | #########6 |  96% [A
2025-05-07T20:25:33.1031942Z 
2025-05-07T20:25:33.1031949Z 
2025-05-07T20:25:33.1031954Z 
2025-05-07T20:25:33.1031970Z 
2025-05-07T20:25:33.1031975Z 
2025-05-07T20:25:33.1031980Z 
2025-05-07T20:25:33.1295639Z cuda-nsight-12.8.55  | 113.2 MB  | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:25:33.1460458Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:25:33.1460732Z 
2025-05-07T20:25:33.1460737Z 
2025-05-07T20:25:33.1460740Z 
2025-05-07T20:25:33.1460744Z 
2025-05-07T20:25:33.1460748Z 
2025-05-07T20:25:33.1460779Z 
2025-05-07T20:25:33.1462895Z 
2025-05-07T20:25:33.1756884Z cuda-nvvp-12.8.57    | 112.4 MB  | ###1       |  32% [A[A[A[A[A[A[A
2025-05-07T20:25:33.1757185Z 
2025-05-07T20:25:33.1774050Z nsight-compute-2025. | 320.6 MB  | #########7 |  97% [A
2025-05-07T20:25:33.1774320Z 
2025-05-07T20:25:33.1774325Z 
2025-05-07T20:25:33.1774329Z 
2025-05-07T20:25:33.1774332Z 
2025-05-07T20:25:33.1776113Z 
2025-05-07T20:25:33.2112498Z libnpp-12.3.3.65     | 130.6 MB  | #######    |  71% [A[A[A[A[A
2025-05-07T20:25:33.2112866Z 
2025-05-07T20:25:33.2112870Z 
2025-05-07T20:25:33.2112873Z 
2025-05-07T20:25:33.2112877Z 
2025-05-07T20:25:33.2112881Z 
2025-05-07T20:25:33.2116931Z 
2025-05-07T20:25:33.2304570Z cuda-nsight-12.8.55  | 113.2 MB  | ####4      |  44% [A[A[A[A[A[A
2025-05-07T20:25:33.2650644Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:25:33.2651044Z 
2025-05-07T20:25:33.2651050Z 
2025-05-07T20:25:33.2651055Z 
2025-05-07T20:25:33.2651060Z 
2025-05-07T20:25:33.2651107Z 
2025-05-07T20:25:33.2651114Z 
2025-05-07T20:25:33.2651119Z 
2025-05-07T20:25:33.2762386Z cuda-nvvp-12.8.57    | 112.4 MB  | ###3       |  34% [A[A[A[A[A[A[A
2025-05-07T20:25:33.2764529Z 
2025-05-07T20:25:33.2888592Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:25:33.2888873Z 
2025-05-07T20:25:33.2888878Z 
2025-05-07T20:25:33.2888881Z 
2025-05-07T20:25:33.2888885Z 
2025-05-07T20:25:33.2891007Z 
2025-05-07T20:25:33.3197755Z libnpp-12.3.3.65     | 130.6 MB  | #######2   |  73% [A[A[A[A[A
2025-05-07T20:25:33.3198107Z 
2025-05-07T20:25:33.3198113Z 
2025-05-07T20:25:33.3198118Z 
2025-05-07T20:25:33.3198122Z 
2025-05-07T20:25:33.3198126Z 
2025-05-07T20:25:33.3200047Z 
2025-05-07T20:25:33.3308860Z cuda-nsight-12.8.55  | 113.2 MB  | ####6      |  46% [A[A[A[A[A[A
2025-05-07T20:25:33.3651318Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  79% 
2025-05-07T20:25:33.3651699Z 
2025-05-07T20:25:33.3651706Z 
2025-05-07T20:25:33.3651711Z 
2025-05-07T20:25:33.3651716Z 
2025-05-07T20:25:33.3651752Z 
2025-05-07T20:25:33.3651758Z 
2025-05-07T20:25:33.3651763Z 
2025-05-07T20:25:33.3818812Z cuda-nvvp-12.8.57    | 112.4 MB  | ###5       |  36% [A[A[A[A[A[A[A
2025-05-07T20:25:33.3819126Z 
2025-05-07T20:25:33.3977291Z nsight-compute-2025. | 320.6 MB  | #########8 |  99% [A
2025-05-07T20:25:33.3977595Z 
2025-05-07T20:25:33.3977599Z 
2025-05-07T20:25:33.3977603Z 
2025-05-07T20:25:33.3977607Z 
2025-05-07T20:25:33.3978319Z 
2025-05-07T20:25:33.4289566Z libnpp-12.3.3.65     | 130.6 MB  | #######4   |  75% [A[A[A[A[A
2025-05-07T20:25:33.4289883Z 
2025-05-07T20:25:33.4289889Z 
2025-05-07T20:25:33.4289895Z 
2025-05-07T20:25:33.4289899Z 
2025-05-07T20:25:33.4289902Z 
2025-05-07T20:25:33.4291629Z 
2025-05-07T20:25:33.4309109Z cuda-nsight-12.8.55  | 113.2 MB  | ####8      |  48% [A[A[A[A[A[A
2025-05-07T20:25:33.4652300Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:25:33.4652586Z 
2025-05-07T20:25:33.4652596Z 
2025-05-07T20:25:33.4652601Z 
2025-05-07T20:25:33.4652834Z 
2025-05-07T20:25:33.4652838Z 
2025-05-07T20:25:33.4652842Z 
2025-05-07T20:25:33.4654392Z 
2025-05-07T20:25:33.4824512Z cuda-nvvp-12.8.57    | 112.4 MB  | ###7       |  38% [A[A[A[A[A[A[A
2025-05-07T20:25:33.4826106Z 
2025-05-07T20:25:33.4979957Z nsight-compute-2025. | 320.6 MB  | #########9 | 100% [A
2025-05-07T20:25:33.4980249Z 
2025-05-07T20:25:33.4980254Z 
2025-05-07T20:25:33.4980257Z 
2025-05-07T20:25:33.4980261Z 
2025-05-07T20:25:33.4980265Z 
2025-05-07T20:25:33.5292957Z libnpp-12.3.3.65     | 130.6 MB  | #######6   |  76% [A[A[A[A[A
2025-05-07T20:25:33.5293424Z 
2025-05-07T20:25:33.5293429Z 
2025-05-07T20:25:33.5293432Z 
2025-05-07T20:25:33.5293436Z 
2025-05-07T20:25:33.5293440Z 
2025-05-07T20:25:33.5295505Z 
2025-05-07T20:25:33.5313084Z cuda-nsight-12.8.55  | 113.2 MB  | #####      |  50% [A[A[A[A[A[A
2025-05-07T20:25:33.5655332Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:25:33.5655740Z 
2025-05-07T20:25:33.5655746Z 
2025-05-07T20:25:33.5655786Z 
2025-05-07T20:25:33.5655791Z 
2025-05-07T20:25:33.5655797Z 
2025-05-07T20:25:33.5655802Z 
2025-05-07T20:25:33.5659166Z 
2025-05-07T20:25:33.5983753Z cuda-nvvp-12.8.57    | 112.4 MB  | ####       |  40% [A[A[A[A[A[A[A
2025-05-07T20:25:33.5984221Z 
2025-05-07T20:25:33.5984229Z 
2025-05-07T20:25:33.5984235Z 
2025-05-07T20:25:33.5984240Z 
2025-05-07T20:25:33.5984245Z 
2025-05-07T20:25:33.6296769Z libnpp-12.3.3.65     | 130.6 MB  | #######8   |  78% [A[A[A[A[A
2025-05-07T20:25:33.6297076Z 
2025-05-07T20:25:33.6297082Z 
2025-05-07T20:25:33.6297086Z 
2025-05-07T20:25:33.6297090Z 
2025-05-07T20:25:33.6297094Z 
2025-05-07T20:25:33.6298235Z 
2025-05-07T20:25:33.6317245Z cuda-nsight-12.8.55  | 113.2 MB  | #####2     |  52% [A[A[A[A[A[A
2025-05-07T20:25:33.6655900Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  81% 
2025-05-07T20:25:33.6656171Z 
2025-05-07T20:25:33.6656183Z 
2025-05-07T20:25:33.6656187Z 
2025-05-07T20:25:33.6656191Z 
2025-05-07T20:25:33.6656196Z 
2025-05-07T20:25:33.6656199Z 
2025-05-07T20:25:33.6656486Z 
2025-05-07T20:25:33.7005508Z cuda-nvvp-12.8.57    | 112.4 MB  | ####2      |  42% [A[A[A[A[A[A[A
2025-05-07T20:25:33.7005912Z 
2025-05-07T20:25:33.7005918Z 
2025-05-07T20:25:33.7005923Z 
2025-05-07T20:25:33.7005953Z 
2025-05-07T20:25:33.7008401Z 
2025-05-07T20:25:33.7320278Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  80% [A[A[A[A[A
2025-05-07T20:25:33.7321039Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:25:33.7321291Z 
2025-05-07T20:25:33.7321295Z 
2025-05-07T20:25:33.7321299Z 
2025-05-07T20:25:33.7321303Z 
2025-05-07T20:25:33.7321307Z 
2025-05-07T20:25:33.7321654Z 
2025-05-07T20:25:33.7659906Z cuda-nsight-12.8.55  | 113.2 MB  | #####4     |  54% [A[A[A[A[A[A
2025-05-07T20:25:33.7660246Z 
2025-05-07T20:25:33.7660250Z 
2025-05-07T20:25:33.7660254Z 
2025-05-07T20:25:33.7660258Z 
2025-05-07T20:25:33.7660261Z 
2025-05-07T20:25:33.7660266Z 
2025-05-07T20:25:33.7660644Z 
2025-05-07T20:25:33.8006465Z cuda-nvvp-12.8.57    | 112.4 MB  | ####4      |  45% [A[A[A[A[A[A[A
2025-05-07T20:25:33.8006919Z 
2025-05-07T20:25:33.8006926Z 
2025-05-07T20:25:33.8006931Z 
2025-05-07T20:25:33.8006935Z 
2025-05-07T20:25:33.8006947Z 
2025-05-07T20:25:33.8324556Z libnpp-12.3.3.65     | 130.6 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:33.8324949Z 
2025-05-07T20:25:33.8324955Z 
2025-05-07T20:25:33.8324960Z 
2025-05-07T20:25:33.8324965Z 
2025-05-07T20:25:33.8324970Z 
2025-05-07T20:25:33.8324980Z 
2025-05-07T20:25:33.8660426Z cuda-nsight-12.8.55  | 113.2 MB  | #####7     |  57% [A[A[A[A[A[A
2025-05-07T20:25:33.8660838Z 
2025-05-07T20:25:33.8660844Z 
2025-05-07T20:25:33.8660849Z 
2025-05-07T20:25:33.8660854Z 
2025-05-07T20:25:33.8660859Z 
2025-05-07T20:25:33.8660864Z 
2025-05-07T20:25:33.8663513Z 
2025-05-07T20:25:33.9007747Z cuda-nvvp-12.8.57    | 112.4 MB  | ####7      |  47% [A[A[A[A[A[A[A
2025-05-07T20:25:33.9008150Z 
2025-05-07T20:25:33.9008156Z 
2025-05-07T20:25:33.9008161Z 
2025-05-07T20:25:33.9008166Z 
2025-05-07T20:25:33.9010197Z 
2025-05-07T20:25:33.9324140Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  84% [A[A[A[A[A
2025-05-07T20:25:33.9324533Z 
2025-05-07T20:25:33.9324539Z 
2025-05-07T20:25:33.9324544Z 
2025-05-07T20:25:33.9324550Z 
2025-05-07T20:25:33.9324573Z 
2025-05-07T20:25:33.9324587Z 
2025-05-07T20:25:33.9523997Z cuda-nsight-12.8.55  | 113.2 MB  | #####9     |  60% [A[A[A[A[A[A
2025-05-07T20:25:33.9713349Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:25:33.9713693Z 
2025-05-07T20:25:33.9713700Z 
2025-05-07T20:25:33.9713705Z 
2025-05-07T20:25:33.9713710Z 
2025-05-07T20:25:33.9713715Z 
2025-05-07T20:25:33.9713720Z 
2025-05-07T20:25:33.9715978Z 
2025-05-07T20:25:34.0333529Z cuda-nvvp-12.8.57    | 112.4 MB  | ####9      |  50% [A[A[A[A[A[A[A
2025-05-07T20:25:34.0333843Z 
2025-05-07T20:25:34.0333847Z 
2025-05-07T20:25:34.0333851Z 
2025-05-07T20:25:34.0333855Z 
2025-05-07T20:25:34.0333859Z 
2025-05-07T20:25:34.0339030Z 
2025-05-07T20:25:34.0546139Z cuda-nsight-12.8.55  | 113.2 MB  | ######2    |  62% [A[A[A[A[A[A
2025-05-07T20:25:34.0631495Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:25:34.0631864Z 
2025-05-07T20:25:34.0631870Z 
2025-05-07T20:25:34.0631875Z 
2025-05-07T20:25:34.0631880Z 
2025-05-07T20:25:34.0631904Z 
2025-05-07T20:25:34.0763623Z libnpp-12.3.3.65     | 130.6 MB  | ########5  |  86% [A[A[A[A[A
2025-05-07T20:25:34.0764003Z 
2025-05-07T20:25:34.0764009Z 
2025-05-07T20:25:34.0764014Z 
2025-05-07T20:25:34.0764019Z 
2025-05-07T20:25:34.0764024Z 
2025-05-07T20:25:34.0764029Z 
2025-05-07T20:25:34.0765288Z 
2025-05-07T20:25:34.1397454Z cuda-nvvp-12.8.57    | 112.4 MB  | #####1     |  52% [A[A[A[A[A[A[A
2025-05-07T20:25:34.1397870Z 
2025-05-07T20:25:34.1397876Z 
2025-05-07T20:25:34.1397880Z 
2025-05-07T20:25:34.1397885Z 
2025-05-07T20:25:34.1397890Z 
2025-05-07T20:25:34.1398542Z 
2025-05-07T20:25:34.1546351Z cuda-nsight-12.8.55  | 113.2 MB  | ######4    |  65% [A[A[A[A[A[A
2025-05-07T20:25:34.1635207Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  83% 
2025-05-07T20:25:34.1635595Z 
2025-05-07T20:25:34.1635601Z 
2025-05-07T20:25:34.1635607Z 
2025-05-07T20:25:34.1635626Z 
2025-05-07T20:25:34.1635635Z 
2025-05-07T20:25:34.1763865Z libnpp-12.3.3.65     | 130.6 MB  | ########7  |  88% [A[A[A[A[A
2025-05-07T20:25:34.1764255Z 
2025-05-07T20:25:34.1764261Z 
2025-05-07T20:25:34.1764277Z 
2025-05-07T20:25:34.1764282Z 
2025-05-07T20:25:34.1764287Z 
2025-05-07T20:25:34.1764292Z 
2025-05-07T20:25:34.1766391Z 
2025-05-07T20:25:34.2408906Z cuda-nvvp-12.8.57    | 112.4 MB  | #####4     |  54% [A[A[A[A[A[A[A
2025-05-07T20:25:34.2409321Z 
2025-05-07T20:25:34.2409327Z 
2025-05-07T20:25:34.2409332Z 
2025-05-07T20:25:34.2409337Z 
2025-05-07T20:25:34.2409341Z 
2025-05-07T20:25:34.2409348Z 
2025-05-07T20:25:34.2627092Z cuda-nsight-12.8.55  | 113.2 MB  | ######7    |  67% [A[A[A[A[A[A
2025-05-07T20:25:34.2774012Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:25:34.2774371Z 
2025-05-07T20:25:34.2774378Z 
2025-05-07T20:25:34.2774418Z 
2025-05-07T20:25:34.2774424Z 
2025-05-07T20:25:34.2774429Z 
2025-05-07T20:25:34.2774443Z 
2025-05-07T20:25:34.2776170Z 
2025-05-07T20:25:34.2882888Z cuda-nvvp-12.8.57    | 112.4 MB  | #####6     |  56% [A[A[A[A[A[A[A
2025-05-07T20:25:34.2883538Z 
2025-05-07T20:25:34.2883546Z 
2025-05-07T20:25:34.2883561Z 
2025-05-07T20:25:34.2883566Z 
2025-05-07T20:25:34.2886642Z 
2025-05-07T20:25:34.3431394Z libnpp-12.3.3.65     | 130.6 MB  | ########9  |  89% [A[A[A[A[A
2025-05-07T20:25:34.3431782Z 
2025-05-07T20:25:34.3431795Z 
2025-05-07T20:25:34.3431799Z 
2025-05-07T20:25:34.3431803Z 
2025-05-07T20:25:34.3431807Z 
2025-05-07T20:25:34.3431841Z 
2025-05-07T20:25:34.3635289Z cuda-nsight-12.8.55  | 113.2 MB  | ######9    |  70% [A[A[A[A[A[A
2025-05-07T20:25:34.3885191Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  84% 
2025-05-07T20:25:34.3885552Z 
2025-05-07T20:25:34.3885558Z 
2025-05-07T20:25:34.3885564Z 
2025-05-07T20:25:34.3885569Z 
2025-05-07T20:25:34.3888211Z 
2025-05-07T20:25:34.4431793Z libnpp-12.3.3.65     | 130.6 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:25:34.4432453Z 
2025-05-07T20:25:34.4432461Z 
2025-05-07T20:25:34.4432479Z 
2025-05-07T20:25:34.4432485Z 
2025-05-07T20:25:34.4432489Z 
2025-05-07T20:25:34.4436019Z 
2025-05-07T20:25:34.4637457Z cuda-nsight-12.8.55  | 113.2 MB  | #######2   |  72% [A[A[A[A[A[A
2025-05-07T20:25:34.4889419Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:25:34.4889732Z 
2025-05-07T20:25:34.4889737Z 
2025-05-07T20:25:34.4889740Z 
2025-05-07T20:25:34.4889744Z 
2025-05-07T20:25:34.4893122Z 
2025-05-07T20:25:34.5011029Z libnpp-12.3.3.65     | 130.6 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:25:34.5011450Z 
2025-05-07T20:25:34.5011456Z 
2025-05-07T20:25:34.5011461Z 
2025-05-07T20:25:34.5011466Z 
2025-05-07T20:25:34.5011472Z 
2025-05-07T20:25:34.5011477Z 
2025-05-07T20:25:34.5011482Z 
2025-05-07T20:25:34.5444914Z cuda-nvvp-12.8.57    | 112.4 MB  | #####8     |  59% [A[A[A[A[A[A[A
2025-05-07T20:25:34.5445233Z 
2025-05-07T20:25:34.5445267Z 
2025-05-07T20:25:34.5445271Z 
2025-05-07T20:25:34.5445274Z 
2025-05-07T20:25:34.5445278Z 
2025-05-07T20:25:34.5445290Z 
2025-05-07T20:25:34.5641489Z cuda-nsight-12.8.55  | 113.2 MB  | #######4   |  75% [A[A[A[A[A[A
2025-05-07T20:25:34.5977919Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:25:34.5978315Z 
2025-05-07T20:25:34.5978322Z 
2025-05-07T20:25:34.5978327Z 
2025-05-07T20:25:34.5978332Z 
2025-05-07T20:25:34.5980088Z 
2025-05-07T20:25:34.6012523Z libnpp-12.3.3.65     | 130.6 MB  | #########5 |  95% [A[A[A[A[A
2025-05-07T20:25:34.6012810Z 
2025-05-07T20:25:34.6012815Z 
2025-05-07T20:25:34.6012818Z 
2025-05-07T20:25:34.6012822Z 
2025-05-07T20:25:34.6012825Z 
2025-05-07T20:25:34.6012829Z 
2025-05-07T20:25:34.6012833Z 
2025-05-07T20:25:34.6642108Z cuda-nvvp-12.8.57    | 112.4 MB  | ######1    |  61% [A[A[A[A[A[A[A
2025-05-07T20:25:34.6705351Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  86% 
2025-05-07T20:25:34.6705654Z 
2025-05-07T20:25:34.6705698Z 
2025-05-07T20:25:34.6705704Z 
2025-05-07T20:25:34.6705711Z 
2025-05-07T20:25:34.6705717Z 
2025-05-07T20:25:34.6708079Z 
2025-05-07T20:25:34.6978807Z cuda-nsight-12.8.55  | 113.2 MB  | #######7   |  77% [A[A[A[A[A[A
2025-05-07T20:25:34.6979134Z 
2025-05-07T20:25:34.6979138Z 
2025-05-07T20:25:34.6979142Z 
2025-05-07T20:25:34.6979146Z 
2025-05-07T20:25:34.6981882Z 
2025-05-07T20:25:34.7020548Z libnpp-12.3.3.65     | 130.6 MB  | #########7 |  97% [A[A[A[A[A
2025-05-07T20:25:34.7020896Z 
2025-05-07T20:25:34.7020903Z 
2025-05-07T20:25:34.7020908Z 
2025-05-07T20:25:34.7020913Z 
2025-05-07T20:25:34.7020918Z 
2025-05-07T20:25:34.7020923Z 
2025-05-07T20:25:34.7020928Z 
2025-05-07T20:25:34.7692126Z cuda-nvvp-12.8.57    | 112.4 MB  | ######3    |  63% [A[A[A[A[A[A[A
2025-05-07T20:25:34.7710773Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:25:34.7711145Z 
2025-05-07T20:25:34.7711152Z 
2025-05-07T20:25:34.7711157Z 
2025-05-07T20:25:34.7711162Z 
2025-05-07T20:25:34.7711201Z 
2025-05-07T20:25:34.7711206Z 
2025-05-07T20:25:34.7978949Z cuda-nsight-12.8.55  | 113.2 MB  | #######9   |  80% [A[A[A[A[A[A
2025-05-07T20:25:34.7979250Z 
2025-05-07T20:25:34.7979254Z 
2025-05-07T20:25:34.7979257Z 
2025-05-07T20:25:34.7979507Z 
2025-05-07T20:25:34.7980599Z 
2025-05-07T20:25:34.8030779Z libnpp-12.3.3.65     | 130.6 MB  | #########9 |  99% [A[A[A[A[A
2025-05-07T20:25:34.8031064Z 
2025-05-07T20:25:34.8031070Z 
2025-05-07T20:25:34.8031074Z 
2025-05-07T20:25:34.8031077Z 
2025-05-07T20:25:34.8031081Z 
2025-05-07T20:25:34.8031085Z 
2025-05-07T20:25:34.8031088Z 
2025-05-07T20:25:34.8694820Z cuda-nvvp-12.8.57    | 112.4 MB  | ######5    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:34.8713608Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:25:34.8713970Z 
2025-05-07T20:25:34.8713975Z 
2025-05-07T20:25:34.8713978Z 
2025-05-07T20:25:34.8713985Z 
2025-05-07T20:25:34.8713990Z 
2025-05-07T20:25:34.8713996Z 
2025-05-07T20:25:34.9035076Z cuda-nsight-12.8.55  | 113.2 MB  | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:25:34.9035662Z 
2025-05-07T20:25:34.9035668Z 
2025-05-07T20:25:34.9035673Z 
2025-05-07T20:25:34.9035677Z 
2025-05-07T20:25:34.9035680Z 
2025-05-07T20:25:34.9035691Z 
2025-05-07T20:25:34.9035705Z 
2025-05-07T20:25:34.9715054Z cuda-nvvp-12.8.57    | 112.4 MB  | ######7    |  68% [A[A[A[A[A[A[A
2025-05-07T20:25:34.9715497Z 
2025-05-07T20:25:34.9715513Z 
2025-05-07T20:25:34.9715518Z 
2025-05-07T20:25:34.9715524Z 
2025-05-07T20:25:34.9715529Z 
2025-05-07T20:25:34.9715534Z 
2025-05-07T20:25:34.9725267Z cuda-nsight-12.8.55  | 113.2 MB  | ########4  |  85% [A[A[A[A[A[A
2025-05-07T20:25:35.0044552Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  88% 
2025-05-07T20:25:35.0044824Z 
2025-05-07T20:25:35.0045196Z 
2025-05-07T20:25:35.0045213Z 
2025-05-07T20:25:35.0045219Z 
2025-05-07T20:25:35.0045225Z 
2025-05-07T20:25:35.0045233Z 
2025-05-07T20:25:35.0045241Z 
2025-05-07T20:25:35.0717322Z cuda-nvvp-12.8.57    | 112.4 MB  | #######    |  70% [A[A[A[A[A[A[A
2025-05-07T20:25:35.0717741Z 
2025-05-07T20:25:35.0717747Z 
2025-05-07T20:25:35.0717752Z 
2025-05-07T20:25:35.0717757Z 
2025-05-07T20:25:35.0717762Z 
2025-05-07T20:25:35.0717767Z 
2025-05-07T20:25:35.0725816Z cuda-nsight-12.8.55  | 113.2 MB  | ########7  |  88% [A[A[A[A[A[A
2025-05-07T20:25:35.1046175Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  88% 
2025-05-07T20:25:35.1046581Z 
2025-05-07T20:25:35.1046587Z 
2025-05-07T20:25:35.1046593Z 
2025-05-07T20:25:35.1046612Z 
2025-05-07T20:25:35.1046618Z 
2025-05-07T20:25:35.1046623Z 
2025-05-07T20:25:35.1046628Z 
2025-05-07T20:25:35.1717257Z cuda-nvvp-12.8.57    | 112.4 MB  | #######2   |  73% [A[A[A[A[A[A[A
2025-05-07T20:25:35.1717698Z 
2025-05-07T20:25:35.1717711Z 
2025-05-07T20:25:35.1717715Z 
2025-05-07T20:25:35.1717719Z 
2025-05-07T20:25:35.1717726Z 
2025-05-07T20:25:35.1717730Z 
2025-05-07T20:25:35.1739881Z cuda-nsight-12.8.55  | 113.2 MB  | #########  |  90% [A[A[A[A[A[A
2025-05-07T20:25:35.2049336Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  89% 
2025-05-07T20:25:35.2049758Z 
2025-05-07T20:25:35.2049765Z 
2025-05-07T20:25:35.2049771Z 
2025-05-07T20:25:35.2049777Z 
2025-05-07T20:25:35.2049782Z 
2025-05-07T20:25:35.2049789Z 
2025-05-07T20:25:35.2049809Z 
2025-05-07T20:25:35.2720310Z cuda-nvvp-12.8.57    | 112.4 MB  | #######5   |  75% [A[A[A[A[A[A[A
2025-05-07T20:25:35.2720690Z 
2025-05-07T20:25:35.2720694Z 
2025-05-07T20:25:35.2720698Z 
2025-05-07T20:25:35.2720701Z 
2025-05-07T20:25:35.2720705Z 
2025-05-07T20:25:35.2720709Z 
2025-05-07T20:25:35.2809302Z cuda-nsight-12.8.55  | 113.2 MB  | #########2 |  93% [A[A[A[A[A[A
2025-05-07T20:25:35.3086989Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:25:35.3087390Z 
2025-05-07T20:25:35.3087397Z 
2025-05-07T20:25:35.3087403Z 
2025-05-07T20:25:35.3087409Z 
2025-05-07T20:25:35.3087415Z 
2025-05-07T20:25:35.3087421Z 
2025-05-07T20:25:35.3088807Z 
2025-05-07T20:25:35.3732359Z cuda-nvvp-12.8.57    | 112.4 MB  | #######7   |  77% [A[A[A[A[A[A[A
2025-05-07T20:25:35.3732845Z 
2025-05-07T20:25:35.3732852Z 
2025-05-07T20:25:35.3732857Z 
2025-05-07T20:25:35.3732862Z 
2025-05-07T20:25:35.3732867Z 
2025-05-07T20:25:35.3732873Z 
2025-05-07T20:25:35.3813006Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  95% [A[A[A[A[A[A
2025-05-07T20:25:35.4093136Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  90% 
2025-05-07T20:25:35.4093415Z 
2025-05-07T20:25:35.4093419Z 
2025-05-07T20:25:35.4093423Z 
2025-05-07T20:25:35.4093426Z 
2025-05-07T20:25:35.4093430Z 
2025-05-07T20:25:35.4093434Z 
2025-05-07T20:25:35.4096883Z 
2025-05-07T20:25:35.4817981Z cuda-nvvp-12.8.57    | 112.4 MB  | #######9   |  80% [A[A[A[A[A[A[A
2025-05-07T20:25:35.4927853Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  91% 
2025-05-07T20:25:35.4928117Z 
2025-05-07T20:25:35.4928121Z 
2025-05-07T20:25:35.4928125Z 
2025-05-07T20:25:35.4928129Z 
2025-05-07T20:25:35.4928133Z 
2025-05-07T20:25:35.4928137Z 
2025-05-07T20:25:35.5205426Z cuda-nsight-12.8.55  | 113.2 MB  | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:25:35.5205992Z 
2025-05-07T20:25:35.5205997Z 
2025-05-07T20:25:35.5206000Z 
2025-05-07T20:25:35.5206004Z 
2025-05-07T20:25:35.5206008Z 
2025-05-07T20:25:35.5206012Z 
2025-05-07T20:25:35.5206453Z 
2025-05-07T20:25:35.5871081Z cuda-nvvp-12.8.57    | 112.4 MB  | ########2  |  82% [A[A[A[A[A[A[A
2025-05-07T20:25:35.6205699Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:25:35.6206029Z 
2025-05-07T20:25:35.6206035Z 
2025-05-07T20:25:35.6206039Z 
2025-05-07T20:25:35.6206045Z 
2025-05-07T20:25:35.6206050Z 
2025-05-07T20:25:35.6206056Z 
2025-05-07T20:25:35.6206061Z 
2025-05-07T20:25:35.6871831Z cuda-nvvp-12.8.57    | 112.4 MB  | ########4  |  84% [A[A[A[A[A[A[A
2025-05-07T20:25:35.7213852Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  92% 
2025-05-07T20:25:35.7214125Z 
2025-05-07T20:25:35.7214138Z 
2025-05-07T20:25:35.7214142Z 
2025-05-07T20:25:35.7214146Z 
2025-05-07T20:25:35.7214150Z 
2025-05-07T20:25:35.7214153Z 
2025-05-07T20:25:35.7214990Z 
2025-05-07T20:25:35.7878979Z cuda-nvvp-12.8.57    | 112.4 MB  | ########6  |  87% [A[A[A[A[A[A[A
2025-05-07T20:25:35.8218085Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:25:35.8218353Z 
2025-05-07T20:25:35.8218357Z 
2025-05-07T20:25:35.8218400Z 
2025-05-07T20:25:35.8218406Z 
2025-05-07T20:25:35.8218421Z 
2025-05-07T20:25:35.8218427Z 
2025-05-07T20:25:35.8218431Z 
2025-05-07T20:25:35.8879486Z cuda-nvvp-12.8.57    | 112.4 MB  | ########9  |  90% [A[A[A[A[A[A[A
2025-05-07T20:25:35.9221282Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  93% 
2025-05-07T20:25:35.9221560Z 
2025-05-07T20:25:35.9221564Z 
2025-05-07T20:25:35.9221568Z 
2025-05-07T20:25:35.9221572Z 
2025-05-07T20:25:35.9221575Z 
2025-05-07T20:25:35.9221580Z 
2025-05-07T20:25:35.9221584Z 
2025-05-07T20:25:35.9879800Z cuda-nvvp-12.8.57    | 112.4 MB  | #########2 |  93% [A[A[A[A[A[A[A
2025-05-07T20:25:36.0223893Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  94% 
2025-05-07T20:25:36.0224236Z 
2025-05-07T20:25:36.0224278Z 
2025-05-07T20:25:36.0224284Z 
2025-05-07T20:25:36.0224289Z 
2025-05-07T20:25:36.0224294Z 
2025-05-07T20:25:36.0224314Z 
2025-05-07T20:25:36.0224750Z 
2025-05-07T20:25:36.1055487Z cuda-nvvp-12.8.57    | 112.4 MB  | #########5 |  95% [A[A[A[A[A[A[A
2025-05-07T20:25:36.1245399Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  94% 
2025-05-07T20:25:36.1245734Z 
2025-05-07T20:25:36.1245739Z 
2025-05-07T20:25:36.1245747Z 
2025-05-07T20:25:36.1245752Z 
2025-05-07T20:25:36.1245757Z 
2025-05-07T20:25:36.1245763Z 
2025-05-07T20:25:36.1245779Z 
2025-05-07T20:25:36.2101633Z cuda-nvvp-12.8.57    | 112.4 MB  | #########8 |  98% [A[A[A[A[A[A[A
2025-05-07T20:25:36.3102517Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  95% 
2025-05-07T20:25:36.4107167Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  96% 
2025-05-07T20:25:36.5106883Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  96% 
2025-05-07T20:25:36.6589376Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:25:36.7594829Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  98% 
2025-05-07T20:25:36.8596627Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  98% 
2025-05-07T20:25:38.5124394Z libcublas-12.8.3.14  | 460.2 MB  | #########9 |  99% 
2025-05-07T20:25:38.5124974Z 
2025-05-07T20:25:38.5124995Z 
2025-05-07T20:25:38.5125000Z 
2025-05-07T20:25:38.5125005Z 
2025-05-07T20:25:39.0297125Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:39.0297464Z 
2025-05-07T20:25:39.0297468Z 
2025-05-07T20:25:39.0297472Z 
2025-05-07T20:25:39.0297476Z 
2025-05-07T20:25:39.0299762Z 
2025-05-07T20:25:39.0769233Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:39.0769523Z 
2025-05-07T20:25:39.0769527Z 
2025-05-07T20:25:39.0769534Z 
2025-05-07T20:25:39.0769538Z 
2025-05-07T20:25:39.0769542Z 
2025-05-07T20:25:39.0769546Z 
2025-05-07T20:25:39.0769550Z 
2025-05-07T20:25:39.0776407Z 
2025-05-07T20:25:39.1773531Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.1774250Z 
2025-05-07T20:25:39.1774256Z 
2025-05-07T20:25:39.1774262Z 
2025-05-07T20:25:39.1774267Z 
2025-05-07T20:25:39.1774272Z 
2025-05-07T20:25:39.1774288Z 
2025-05-07T20:25:39.1774294Z 
2025-05-07T20:25:39.1774315Z 
2025-05-07T20:25:39.2774070Z cuda-nvrtc-12.8.61   | 63.1 MB   | 5          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.2774378Z 
2025-05-07T20:25:39.2774382Z 
2025-05-07T20:25:39.2774393Z 
2025-05-07T20:25:39.2774397Z 
2025-05-07T20:25:39.2774401Z 
2025-05-07T20:25:39.2774408Z 
2025-05-07T20:25:39.2774413Z 
2025-05-07T20:25:39.2774554Z 
2025-05-07T20:25:39.3876596Z cuda-nvrtc-12.8.61   | 63.1 MB   | #1         |  12% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.3876919Z 
2025-05-07T20:25:39.3876923Z 
2025-05-07T20:25:39.3876930Z 
2025-05-07T20:25:39.3876935Z 
2025-05-07T20:25:39.3876940Z 
2025-05-07T20:25:39.3876945Z 
2025-05-07T20:25:39.3876950Z 
2025-05-07T20:25:39.3877252Z 
2025-05-07T20:25:39.3913406Z cuda-nvrtc-12.8.61   | 63.1 MB   | #7         |  18% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.3913745Z 
2025-05-07T20:25:39.3913749Z 
2025-05-07T20:25:39.3913753Z 
2025-05-07T20:25:39.3913757Z 
2025-05-07T20:25:39.3913760Z 
2025-05-07T20:25:39.3913764Z 
2025-05-07T20:25:39.4422795Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:39.4423258Z 
2025-05-07T20:25:39.4423265Z 
2025-05-07T20:25:39.4423271Z 
2025-05-07T20:25:39.4423278Z 
2025-05-07T20:25:39.4423284Z 
2025-05-07T20:25:39.4423291Z 
2025-05-07T20:25:39.4423296Z 
2025-05-07T20:25:39.4423304Z 
2025-05-07T20:25:39.4423514Z 
2025-05-07T20:25:39.4921809Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:39.4922192Z 
2025-05-07T20:25:39.4922198Z 
2025-05-07T20:25:39.4922204Z 
2025-05-07T20:25:39.4922209Z 
2025-05-07T20:25:39.4922215Z 
2025-05-07T20:25:39.4922219Z 
2025-05-07T20:25:39.4922222Z 
2025-05-07T20:25:39.4923673Z 
2025-05-07T20:25:39.5423872Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##3        |  24% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.5424215Z 
2025-05-07T20:25:39.5424219Z 
2025-05-07T20:25:39.5424223Z 
2025-05-07T20:25:39.5424228Z 
2025-05-07T20:25:39.5424231Z 
2025-05-07T20:25:39.5424235Z 
2025-05-07T20:25:39.5424239Z 
2025-05-07T20:25:39.5424254Z 
2025-05-07T20:25:39.5424258Z 
2025-05-07T20:25:39.6074309Z libcurand-10.3.9.55  | 43.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:39.6074772Z 
2025-05-07T20:25:39.6074778Z 
2025-05-07T20:25:39.6074783Z 
2025-05-07T20:25:39.6074788Z 
2025-05-07T20:25:39.6074793Z 
2025-05-07T20:25:39.6074799Z 
2025-05-07T20:25:39.6074806Z 
2025-05-07T20:25:39.6080422Z 
2025-05-07T20:25:39.6436059Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##9        |  29% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.6436378Z 
2025-05-07T20:25:39.6436382Z 
2025-05-07T20:25:39.6436386Z 
2025-05-07T20:25:39.6436390Z 
2025-05-07T20:25:39.6436393Z 
2025-05-07T20:25:39.6436397Z 
2025-05-07T20:25:39.6436400Z 
2025-05-07T20:25:39.6436404Z 
2025-05-07T20:25:39.6436417Z 
2025-05-07T20:25:39.7102191Z libcurand-10.3.9.55  | 43.6 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:39.7102683Z 
2025-05-07T20:25:39.7102690Z 
2025-05-07T20:25:39.7102696Z 
2025-05-07T20:25:39.7102710Z 
2025-05-07T20:25:39.7102716Z 
2025-05-07T20:25:39.7103022Z 
2025-05-07T20:25:39.7103029Z 
2025-05-07T20:25:39.7103293Z 
2025-05-07T20:25:39.7443635Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###4       |  35% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.7444037Z 
2025-05-07T20:25:39.7444042Z 
2025-05-07T20:25:39.7444047Z 
2025-05-07T20:25:39.7444053Z 
2025-05-07T20:25:39.7444058Z 
2025-05-07T20:25:39.7444063Z 
2025-05-07T20:25:39.7444068Z 
2025-05-07T20:25:39.7444074Z 
2025-05-07T20:25:39.7445729Z 
2025-05-07T20:25:39.8232538Z libcurand-10.3.9.55  | 43.6 MB   | ##         |  21% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:39.8232992Z 
2025-05-07T20:25:39.8232996Z 
2025-05-07T20:25:39.8233000Z 
2025-05-07T20:25:39.8233004Z 
2025-05-07T20:25:39.8233007Z 
2025-05-07T20:25:39.8233011Z 
2025-05-07T20:25:39.8233015Z 
2025-05-07T20:25:39.8233918Z 
2025-05-07T20:25:39.8455226Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###9       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.8455589Z 
2025-05-07T20:25:39.8455593Z 
2025-05-07T20:25:39.8455597Z 
2025-05-07T20:25:39.8455615Z 
2025-05-07T20:25:39.8455619Z 
2025-05-07T20:25:39.8455623Z 
2025-05-07T20:25:39.8455626Z 
2025-05-07T20:25:39.8455630Z 
2025-05-07T20:25:39.8458360Z 
2025-05-07T20:25:39.9233314Z libcurand-10.3.9.55  | 43.6 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:39.9233657Z 
2025-05-07T20:25:39.9233661Z 
2025-05-07T20:25:39.9233665Z 
2025-05-07T20:25:39.9233668Z 
2025-05-07T20:25:39.9233672Z 
2025-05-07T20:25:39.9233676Z 
2025-05-07T20:25:39.9233680Z 
2025-05-07T20:25:39.9234480Z 
2025-05-07T20:25:39.9459402Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####5      |  45% [A[A[A[A[A[A[A[A
2025-05-07T20:25:39.9459722Z 
2025-05-07T20:25:39.9459728Z 
2025-05-07T20:25:39.9459731Z 
2025-05-07T20:25:39.9459735Z 
2025-05-07T20:25:39.9459739Z 
2025-05-07T20:25:39.9459764Z 
2025-05-07T20:25:39.9459768Z 
2025-05-07T20:25:39.9459771Z 
2025-05-07T20:25:39.9461168Z 
2025-05-07T20:25:40.0357013Z libcurand-10.3.9.55  | 43.6 MB   | ###5       |  35% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.0357373Z 
2025-05-07T20:25:40.0357378Z 
2025-05-07T20:25:40.0357382Z 
2025-05-07T20:25:40.0357386Z 
2025-05-07T20:25:40.0357389Z 
2025-05-07T20:25:40.0357393Z 
2025-05-07T20:25:40.0357397Z 
2025-05-07T20:25:40.0357400Z 
2025-05-07T20:25:40.0545753Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####      |  50% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.0546061Z 
2025-05-07T20:25:40.0546065Z 
2025-05-07T20:25:40.0546069Z 
2025-05-07T20:25:40.0546073Z 
2025-05-07T20:25:40.0546076Z 
2025-05-07T20:25:40.0546080Z 
2025-05-07T20:25:40.0546083Z 
2025-05-07T20:25:40.0546095Z 
2025-05-07T20:25:40.0556513Z 
2025-05-07T20:25:40.0838550Z libcurand-10.3.9.55  | 43.6 MB   | ####2      |  43% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.0838972Z 
2025-05-07T20:25:40.0838978Z 
2025-05-07T20:25:40.0839008Z 
2025-05-07T20:25:40.0839014Z 
2025-05-07T20:25:40.0839018Z 
2025-05-07T20:25:40.0839022Z 
2025-05-07T20:25:40.0839025Z 
2025-05-07T20:25:40.1359098Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1359449Z 
2025-05-07T20:25:40.1359453Z 
2025-05-07T20:25:40.1359457Z 
2025-05-07T20:25:40.1359461Z 
2025-05-07T20:25:40.1359465Z 
2025-05-07T20:25:40.1359477Z 
2025-05-07T20:25:40.1359481Z 
2025-05-07T20:25:40.1359485Z 
2025-05-07T20:25:40.1484402Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####5     |  56% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.1484707Z 
2025-05-07T20:25:40.1484711Z 
2025-05-07T20:25:40.1484723Z 
2025-05-07T20:25:40.1484727Z 
2025-05-07T20:25:40.1484730Z 
2025-05-07T20:25:40.1484734Z 
2025-05-07T20:25:40.1484737Z 
2025-05-07T20:25:40.1484741Z 
2025-05-07T20:25:40.1484744Z 
2025-05-07T20:25:40.1484748Z 
2025-05-07T20:25:40.1576401Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.1576738Z 
2025-05-07T20:25:40.1576742Z 
2025-05-07T20:25:40.1576746Z 
2025-05-07T20:25:40.1576749Z 
2025-05-07T20:25:40.1576753Z 
2025-05-07T20:25:40.1576757Z 
2025-05-07T20:25:40.1576760Z 
2025-05-07T20:25:40.1576764Z 
2025-05-07T20:25:40.1579226Z 
2025-05-07T20:25:40.2492281Z libcurand-10.3.9.55  | 43.6 MB   | ####9      |  49% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.2492620Z 
2025-05-07T20:25:40.2492624Z 
2025-05-07T20:25:40.2492628Z 
2025-05-07T20:25:40.2492631Z 
2025-05-07T20:25:40.2492635Z 
2025-05-07T20:25:40.2492639Z 
2025-05-07T20:25:40.2492642Z 
2025-05-07T20:25:40.2492647Z 
2025-05-07T20:25:40.2492650Z 
2025-05-07T20:25:40.2492879Z 
2025-05-07T20:25:40.2586451Z gds-tools-1.13.0.11  | 37.9 MB   | 7          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.2586851Z 
2025-05-07T20:25:40.2586857Z 
2025-05-07T20:25:40.2586862Z 
2025-05-07T20:25:40.2586867Z 
2025-05-07T20:25:40.2586872Z 
2025-05-07T20:25:40.2586877Z 
2025-05-07T20:25:40.2586885Z 
2025-05-07T20:25:40.2588426Z 
2025-05-07T20:25:40.2652375Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######     |  61% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.2652676Z 
2025-05-07T20:25:40.2652680Z 
2025-05-07T20:25:40.2652684Z 
2025-05-07T20:25:40.2652687Z 
2025-05-07T20:25:40.2652698Z 
2025-05-07T20:25:40.2652716Z 
2025-05-07T20:25:40.2652722Z 
2025-05-07T20:25:40.2652728Z 
2025-05-07T20:25:40.2656703Z 
2025-05-07T20:25:40.3494528Z libcurand-10.3.9.55  | 43.6 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.3494870Z 
2025-05-07T20:25:40.3494875Z 
2025-05-07T20:25:40.3494880Z 
2025-05-07T20:25:40.3494885Z 
2025-05-07T20:25:40.3494905Z 
2025-05-07T20:25:40.3494910Z 
2025-05-07T20:25:40.3494915Z 
2025-05-07T20:25:40.3494921Z 
2025-05-07T20:25:40.3494927Z 
2025-05-07T20:25:40.3498145Z 
2025-05-07T20:25:40.3646142Z gds-tools-1.13.0.11  | 37.9 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.3646490Z 
2025-05-07T20:25:40.3646495Z 
2025-05-07T20:25:40.3646508Z 
2025-05-07T20:25:40.3646512Z 
2025-05-07T20:25:40.3646545Z 
2025-05-07T20:25:40.3646549Z 
2025-05-07T20:25:40.3646554Z 
2025-05-07T20:25:40.3646563Z 
2025-05-07T20:25:40.3654790Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######5    |  65% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.3655101Z 
2025-05-07T20:25:40.3655122Z 
2025-05-07T20:25:40.3655126Z 
2025-05-07T20:25:40.3655130Z 
2025-05-07T20:25:40.3655134Z 
2025-05-07T20:25:40.3655137Z 
2025-05-07T20:25:40.3655141Z 
2025-05-07T20:25:40.3655145Z 
2025-05-07T20:25:40.3657500Z 
2025-05-07T20:25:40.4496518Z libcurand-10.3.9.55  | 43.6 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.4496923Z 
2025-05-07T20:25:40.4496927Z 
2025-05-07T20:25:40.4496938Z 
2025-05-07T20:25:40.4496942Z 
2025-05-07T20:25:40.4496947Z 
2025-05-07T20:25:40.4496950Z 
2025-05-07T20:25:40.4496954Z 
2025-05-07T20:25:40.4496958Z 
2025-05-07T20:25:40.4496961Z 
2025-05-07T20:25:40.4496966Z 
2025-05-07T20:25:40.4711449Z gds-tools-1.13.0.11  | 37.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.4711827Z 
2025-05-07T20:25:40.4711833Z 
2025-05-07T20:25:40.4711838Z 
2025-05-07T20:25:40.4711843Z 
2025-05-07T20:25:40.4711848Z 
2025-05-07T20:25:40.4711853Z 
2025-05-07T20:25:40.4711858Z 
2025-05-07T20:25:40.4711863Z 
2025-05-07T20:25:40.4846803Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######    |  70% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.4847203Z 
2025-05-07T20:25:40.4847210Z 
2025-05-07T20:25:40.4847215Z 
2025-05-07T20:25:40.4847220Z 
2025-05-07T20:25:40.4847225Z 
2025-05-07T20:25:40.4847230Z 
2025-05-07T20:25:40.4847235Z 
2025-05-07T20:25:40.4847241Z 
2025-05-07T20:25:40.4847246Z 
2025-05-07T20:25:40.5640687Z libcurand-10.3.9.55  | 43.6 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.5641072Z 
2025-05-07T20:25:40.5641078Z 
2025-05-07T20:25:40.5641083Z 
2025-05-07T20:25:40.5641088Z 
2025-05-07T20:25:40.5641093Z 
2025-05-07T20:25:40.5641098Z 
2025-05-07T20:25:40.5641104Z 
2025-05-07T20:25:40.5641110Z 
2025-05-07T20:25:40.5641115Z 
2025-05-07T20:25:40.5653171Z 
2025-05-07T20:25:40.5732124Z gds-tools-1.13.0.11  | 37.9 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.5732479Z 
2025-05-07T20:25:40.5732483Z 
2025-05-07T20:25:40.5732487Z 
2025-05-07T20:25:40.5732491Z 
2025-05-07T20:25:40.5732759Z 
2025-05-07T20:25:40.5732765Z 
2025-05-07T20:25:40.5732776Z 
2025-05-07T20:25:40.5732780Z 
2025-05-07T20:25:40.5849046Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######4   |  75% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.5849364Z 
2025-05-07T20:25:40.5849371Z 
2025-05-07T20:25:40.5849376Z 
2025-05-07T20:25:40.5849390Z 
2025-05-07T20:25:40.5849396Z 
2025-05-07T20:25:40.5849402Z 
2025-05-07T20:25:40.5849408Z 
2025-05-07T20:25:40.5849413Z 
2025-05-07T20:25:40.5849419Z 
2025-05-07T20:25:40.6738706Z libcurand-10.3.9.55  | 43.6 MB   | #######7   |  78% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.6739040Z 
2025-05-07T20:25:40.6739047Z 
2025-05-07T20:25:40.6739053Z 
2025-05-07T20:25:40.6739058Z 
2025-05-07T20:25:40.6739064Z 
2025-05-07T20:25:40.6739069Z 
2025-05-07T20:25:40.6739425Z 
2025-05-07T20:25:40.6739430Z 
2025-05-07T20:25:40.6777167Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######9   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.6777626Z 
2025-05-07T20:25:40.6777633Z 
2025-05-07T20:25:40.6777638Z 
2025-05-07T20:25:40.6777666Z 
2025-05-07T20:25:40.6777673Z 
2025-05-07T20:25:40.6777678Z 
2025-05-07T20:25:40.6777684Z 
2025-05-07T20:25:40.6777689Z 
2025-05-07T20:25:40.6777695Z 
2025-05-07T20:25:40.6778728Z 
2025-05-07T20:25:40.6856750Z gds-tools-1.13.0.11  | 37.9 MB   | ###8       |  38% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.6857106Z 
2025-05-07T20:25:40.6857110Z 
2025-05-07T20:25:40.6857113Z 
2025-05-07T20:25:40.6857117Z 
2025-05-07T20:25:40.6857121Z 
2025-05-07T20:25:40.6857124Z 
2025-05-07T20:25:40.6857128Z 
2025-05-07T20:25:40.6857139Z 
2025-05-07T20:25:40.6857147Z 
2025-05-07T20:25:40.7779487Z libcurand-10.3.9.55  | 43.6 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7779928Z 
2025-05-07T20:25:40.7779932Z 
2025-05-07T20:25:40.7779946Z 
2025-05-07T20:25:40.7779985Z 
2025-05-07T20:25:40.7779989Z 
2025-05-07T20:25:40.7779992Z 
2025-05-07T20:25:40.7779996Z 
2025-05-07T20:25:40.7780000Z 
2025-05-07T20:25:40.7780003Z 
2025-05-07T20:25:40.7781270Z 
2025-05-07T20:25:40.7830233Z gds-tools-1.13.0.11  | 37.9 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7830546Z 
2025-05-07T20:25:40.7830550Z 
2025-05-07T20:25:40.7830554Z 
2025-05-07T20:25:40.7830565Z 
2025-05-07T20:25:40.7830569Z 
2025-05-07T20:25:40.7830572Z 
2025-05-07T20:25:40.7830576Z 
2025-05-07T20:25:40.7830579Z 
2025-05-07T20:25:40.8004478Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########4  |  84% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8004791Z 
2025-05-07T20:25:40.8004795Z 
2025-05-07T20:25:40.8004799Z 
2025-05-07T20:25:40.8004802Z 
2025-05-07T20:25:40.8004806Z 
2025-05-07T20:25:40.8004810Z 
2025-05-07T20:25:40.8004813Z 
2025-05-07T20:25:40.8004817Z 
2025-05-07T20:25:40.8005040Z 
2025-05-07T20:25:40.8780962Z libcurand-10.3.9.55  | 43.6 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8781438Z 
2025-05-07T20:25:40.8781442Z 
2025-05-07T20:25:40.8781446Z 
2025-05-07T20:25:40.8781450Z 
2025-05-07T20:25:40.8781453Z 
2025-05-07T20:25:40.8781457Z 
2025-05-07T20:25:40.8781461Z 
2025-05-07T20:25:40.8781477Z 
2025-05-07T20:25:40.8781482Z 
2025-05-07T20:25:40.8781851Z 
2025-05-07T20:25:40.8856594Z gds-tools-1.13.0.11  | 37.9 MB   | #####3     |  53% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8856906Z 
2025-05-07T20:25:40.8856910Z 
2025-05-07T20:25:40.8856914Z 
2025-05-07T20:25:40.8856918Z 
2025-05-07T20:25:40.8856921Z 
2025-05-07T20:25:40.8856925Z 
2025-05-07T20:25:40.8856929Z 
2025-05-07T20:25:40.8858133Z 
2025-05-07T20:25:40.9037617Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########8  |  89% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9037912Z 
2025-05-07T20:25:40.9038186Z 
2025-05-07T20:25:40.9038201Z 
2025-05-07T20:25:40.9038212Z 
2025-05-07T20:25:40.9038221Z 
2025-05-07T20:25:40.9038231Z 
2025-05-07T20:25:40.9038241Z 
2025-05-07T20:25:40.9038286Z 
2025-05-07T20:25:40.9040289Z 
2025-05-07T20:25:40.9783570Z libcurand-10.3.9.55  | 43.6 MB   | #########7 |  98% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9783914Z 
2025-05-07T20:25:40.9783920Z 
2025-05-07T20:25:40.9783925Z 
2025-05-07T20:25:40.9784232Z 
2025-05-07T20:25:40.9784242Z 
2025-05-07T20:25:40.9784247Z 
2025-05-07T20:25:40.9784255Z 
2025-05-07T20:25:40.9784261Z 
2025-05-07T20:25:40.9784267Z 
2025-05-07T20:25:40.9789653Z 
2025-05-07T20:25:41.0165941Z gds-tools-1.13.0.11  | 37.9 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0166259Z 
2025-05-07T20:25:41.0166264Z 
2025-05-07T20:25:41.0166269Z 
2025-05-07T20:25:41.0166274Z 
2025-05-07T20:25:41.0166286Z 
2025-05-07T20:25:41.0166291Z 
2025-05-07T20:25:41.0166296Z 
2025-05-07T20:25:41.0166735Z 
2025-05-07T20:25:41.0817130Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########3 |  93% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0817537Z 
2025-05-07T20:25:41.0817542Z 
2025-05-07T20:25:41.0817546Z 
2025-05-07T20:25:41.0817551Z 
2025-05-07T20:25:41.0817838Z 
2025-05-07T20:25:41.0817842Z 
2025-05-07T20:25:41.0817846Z 
2025-05-07T20:25:41.0817850Z 
2025-05-07T20:25:41.0817853Z 
2025-05-07T20:25:41.0817857Z 
2025-05-07T20:25:41.1167075Z gds-tools-1.13.0.11  | 37.9 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1167505Z 
2025-05-07T20:25:41.1167512Z 
2025-05-07T20:25:41.1167517Z 
2025-05-07T20:25:41.1167522Z 
2025-05-07T20:25:41.1167528Z 
2025-05-07T20:25:41.1167532Z 
2025-05-07T20:25:41.1167537Z 
2025-05-07T20:25:41.1169896Z 
2025-05-07T20:25:41.1817200Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########8 |  98% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1817537Z 
2025-05-07T20:25:41.1817541Z 
2025-05-07T20:25:41.1817544Z 
2025-05-07T20:25:41.1817548Z 
2025-05-07T20:25:41.1817552Z 
2025-05-07T20:25:41.1817556Z 
2025-05-07T20:25:41.1817559Z 
2025-05-07T20:25:41.1817564Z 
2025-05-07T20:25:41.1817568Z 
2025-05-07T20:25:41.1817571Z 
2025-05-07T20:25:41.2822027Z gds-tools-1.13.0.11  | 37.9 MB   | #######6   |  76% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2822393Z 
2025-05-07T20:25:41.2822397Z 
2025-05-07T20:25:41.2822401Z 
2025-05-07T20:25:41.2822405Z 
2025-05-07T20:25:41.2822409Z 
2025-05-07T20:25:41.2822412Z 
2025-05-07T20:25:41.2822416Z 
2025-05-07T20:25:41.2822432Z 
2025-05-07T20:25:41.2822448Z 
2025-05-07T20:25:41.2822451Z 
2025-05-07T20:25:41.3826746Z gds-tools-1.13.0.11  | 37.9 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3827080Z 
2025-05-07T20:25:41.3827084Z 
2025-05-07T20:25:41.3827096Z 
2025-05-07T20:25:41.3827100Z 
2025-05-07T20:25:41.3827103Z 
2025-05-07T20:25:41.3827107Z 
2025-05-07T20:25:41.3827111Z 
2025-05-07T20:25:41.3827115Z 
2025-05-07T20:25:41.3827118Z 
2025-05-07T20:25:41.3827122Z 
2025-05-07T20:25:42.3986999Z gds-tools-1.13.0.11  | 37.9 MB   | #########4 |  95% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3987340Z 
2025-05-07T20:25:42.3987344Z 
2025-05-07T20:25:42.3987348Z 
2025-05-07T20:25:42.3987351Z 
2025-05-07T20:25:42.3987355Z 
2025-05-07T20:25:42.3987385Z 
2025-05-07T20:25:42.3987389Z 
2025-05-07T20:25:42.3987392Z 
2025-05-07T20:25:42.3987396Z 
2025-05-07T20:25:42.4602747Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4603063Z 
2025-05-07T20:25:42.4603082Z 
2025-05-07T20:25:42.4603087Z 
2025-05-07T20:25:42.4603094Z 
2025-05-07T20:25:42.4603098Z 
2025-05-07T20:25:42.4603102Z 
2025-05-07T20:25:42.4603106Z 
2025-05-07T20:25:42.4603110Z 
2025-05-07T20:25:42.4603113Z 
2025-05-07T20:25:42.4603117Z 
2025-05-07T20:25:42.4605482Z 
2025-05-07T20:25:42.5607422Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5607751Z 
2025-05-07T20:25:42.5607756Z 
2025-05-07T20:25:42.5607759Z 
2025-05-07T20:25:42.5607763Z 
2025-05-07T20:25:42.5607767Z 
2025-05-07T20:25:42.5607771Z 
2025-05-07T20:25:42.5607775Z 
2025-05-07T20:25:42.5607779Z 
2025-05-07T20:25:42.5607791Z 
2025-05-07T20:25:42.5607795Z 
2025-05-07T20:25:42.5609614Z 
2025-05-07T20:25:42.6626518Z libnvjitlink-12.8.61 | 28.7 MB   | #2         |  12% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6626880Z 
2025-05-07T20:25:42.6626885Z 
2025-05-07T20:25:42.6626888Z 
2025-05-07T20:25:42.6626892Z 
2025-05-07T20:25:42.6626896Z 
2025-05-07T20:25:42.6627181Z 
2025-05-07T20:25:42.6627186Z 
2025-05-07T20:25:42.6627190Z 
2025-05-07T20:25:42.6627193Z 
2025-05-07T20:25:42.6627197Z 
2025-05-07T20:25:42.7330486Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7330801Z 
2025-05-07T20:25:42.7330805Z 
2025-05-07T20:25:42.7330808Z 
2025-05-07T20:25:42.7330812Z 
2025-05-07T20:25:42.7330815Z 
2025-05-07T20:25:42.7330819Z 
2025-05-07T20:25:42.7330823Z 
2025-05-07T20:25:42.7330826Z 
2025-05-07T20:25:42.7330830Z 
2025-05-07T20:25:42.7330833Z 
2025-05-07T20:25:42.7330837Z 
2025-05-07T20:25:42.7331833Z 
2025-05-07T20:25:42.7751064Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7751398Z 
2025-05-07T20:25:42.7751671Z 
2025-05-07T20:25:42.7751675Z 
2025-05-07T20:25:42.7751679Z 
2025-05-07T20:25:42.7751682Z 
2025-05-07T20:25:42.7751686Z 
2025-05-07T20:25:42.7751690Z 
2025-05-07T20:25:42.7751693Z 
2025-05-07T20:25:42.7751697Z 
2025-05-07T20:25:42.7751720Z 
2025-05-07T20:25:42.7751724Z 
2025-05-07T20:25:42.8333529Z libnvjitlink-12.8.61 | 28.7 MB   | ##4        |  25% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8333859Z 
2025-05-07T20:25:42.8333863Z 
2025-05-07T20:25:42.8333875Z 
2025-05-07T20:25:42.8333879Z 
2025-05-07T20:25:42.8333883Z 
2025-05-07T20:25:42.8333886Z 
2025-05-07T20:25:42.8333890Z 
2025-05-07T20:25:42.8333893Z 
2025-05-07T20:25:42.8333897Z 
2025-05-07T20:25:42.8333900Z 
2025-05-07T20:25:42.8333904Z 
2025-05-07T20:25:42.8333907Z 
2025-05-07T20:25:42.8753986Z cuda-nvcc-tools-12.8 | 24.5 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8754377Z 
2025-05-07T20:25:42.8754383Z 
2025-05-07T20:25:42.8754388Z 
2025-05-07T20:25:42.8754395Z 
2025-05-07T20:25:42.8754400Z 
2025-05-07T20:25:42.8754437Z 
2025-05-07T20:25:42.8754442Z 
2025-05-07T20:25:42.8754448Z 
2025-05-07T20:25:42.8754453Z 
2025-05-07T20:25:42.8754459Z 
2025-05-07T20:25:42.8754464Z 
2025-05-07T20:25:42.9336914Z libnvjitlink-12.8.61 | 28.7 MB   | ###3       |  33% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9337246Z 
2025-05-07T20:25:42.9337251Z 
2025-05-07T20:25:42.9337255Z 
2025-05-07T20:25:42.9337258Z 
2025-05-07T20:25:42.9337262Z 
2025-05-07T20:25:42.9337266Z 
2025-05-07T20:25:42.9337270Z 
2025-05-07T20:25:42.9337273Z 
2025-05-07T20:25:42.9337285Z 
2025-05-07T20:25:42.9337289Z 
2025-05-07T20:25:42.9337292Z 
2025-05-07T20:25:42.9337296Z 
2025-05-07T20:25:42.9755535Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9755909Z 
2025-05-07T20:25:42.9755914Z 
2025-05-07T20:25:42.9755918Z 
2025-05-07T20:25:42.9755921Z 
2025-05-07T20:25:42.9755925Z 
2025-05-07T20:25:42.9755929Z 
2025-05-07T20:25:42.9755934Z 
2025-05-07T20:25:42.9755938Z 
2025-05-07T20:25:42.9755962Z 
2025-05-07T20:25:42.9755966Z 
2025-05-07T20:25:42.9755970Z 
2025-05-07T20:25:43.0582247Z libnvjitlink-12.8.61 | 28.7 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0582582Z 
2025-05-07T20:25:43.0582610Z 
2025-05-07T20:25:43.0582614Z 
2025-05-07T20:25:43.0582617Z 
2025-05-07T20:25:43.0582621Z 
2025-05-07T20:25:43.0582625Z 
2025-05-07T20:25:43.0582628Z 
2025-05-07T20:25:43.0582632Z 
2025-05-07T20:25:43.0582635Z 
2025-05-07T20:25:43.0582639Z 
2025-05-07T20:25:43.0582642Z 
2025-05-07T20:25:43.0588502Z 
2025-05-07T20:25:43.0610498Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###5       |  35% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0610895Z 
2025-05-07T20:25:43.0610900Z 
2025-05-07T20:25:43.0761337Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:43.0770679Z 
2025-05-07T20:25:43.0770687Z 
2025-05-07T20:25:43.0770692Z 
2025-05-07T20:25:43.0770698Z 
2025-05-07T20:25:43.0770703Z 
2025-05-07T20:25:43.0770708Z 
2025-05-07T20:25:43.0770732Z 
2025-05-07T20:25:43.0770737Z 
2025-05-07T20:25:43.0770742Z 
2025-05-07T20:25:43.0770748Z 
2025-05-07T20:25:43.0770752Z 
2025-05-07T20:25:43.1317142Z libnvjitlink-12.8.61 | 28.7 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1317497Z 
2025-05-07T20:25:43.1317502Z 
2025-05-07T20:25:43.1317505Z 
2025-05-07T20:25:43.1317509Z 
2025-05-07T20:25:43.1317513Z 
2025-05-07T20:25:43.1317516Z 
2025-05-07T20:25:43.1317520Z 
2025-05-07T20:25:43.1317524Z 
2025-05-07T20:25:43.1694316Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1694717Z 
2025-05-07T20:25:43.1694725Z 
2025-05-07T20:25:43.1694729Z 
2025-05-07T20:25:43.1694734Z 
2025-05-07T20:25:43.1694739Z 
2025-05-07T20:25:43.1694744Z 
2025-05-07T20:25:43.1694749Z 
2025-05-07T20:25:43.1694754Z 
2025-05-07T20:25:43.1694760Z 
2025-05-07T20:25:43.1694767Z 
2025-05-07T20:25:43.1694773Z 
2025-05-07T20:25:43.1694780Z 
2025-05-07T20:25:43.1695805Z 
2025-05-07T20:25:43.1775866Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1776455Z 
2025-05-07T20:25:43.1776460Z 
2025-05-07T20:25:43.1776464Z 
2025-05-07T20:25:43.1776468Z 
2025-05-07T20:25:43.1776479Z 
2025-05-07T20:25:43.1776492Z 
2025-05-07T20:25:43.1776496Z 
2025-05-07T20:25:43.1776499Z 
2025-05-07T20:25:43.1776503Z 
2025-05-07T20:25:43.1776507Z 
2025-05-07T20:25:43.1776510Z 
2025-05-07T20:25:43.1777952Z 
2025-05-07T20:25:43.2381659Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2381994Z 
2025-05-07T20:25:43.2381999Z 
2025-05-07T20:25:43.2382002Z 
2025-05-07T20:25:43.2382006Z 
2025-05-07T20:25:43.2382010Z 
2025-05-07T20:25:43.2382014Z 
2025-05-07T20:25:43.2382018Z 
2025-05-07T20:25:43.2382021Z 
2025-05-07T20:25:43.2382025Z 
2025-05-07T20:25:43.2382028Z 
2025-05-07T20:25:43.2383564Z 
2025-05-07T20:25:43.2695183Z libnvjitlink-12.8.61 | 28.7 MB   | ######     |  60% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2695613Z 
2025-05-07T20:25:43.2695617Z 
2025-05-07T20:25:43.2695621Z 
2025-05-07T20:25:43.2695625Z 
2025-05-07T20:25:43.2695629Z 
2025-05-07T20:25:43.2695632Z 
2025-05-07T20:25:43.2695636Z 
2025-05-07T20:25:43.2695640Z 
2025-05-07T20:25:43.2695654Z 
2025-05-07T20:25:43.2695658Z 
2025-05-07T20:25:43.2695661Z 
2025-05-07T20:25:43.2695674Z 
2025-05-07T20:25:43.2695677Z 
2025-05-07T20:25:43.2779912Z cuda-nvvm-tools-12.8 | 23.5 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2780261Z 
2025-05-07T20:25:43.2780265Z 
2025-05-07T20:25:43.2780277Z 
2025-05-07T20:25:43.2780281Z 
2025-05-07T20:25:43.2780284Z 
2025-05-07T20:25:43.2780288Z 
2025-05-07T20:25:43.2780291Z 
2025-05-07T20:25:43.2780295Z 
2025-05-07T20:25:43.2780299Z 
2025-05-07T20:25:43.2780302Z 
2025-05-07T20:25:43.2780306Z 
2025-05-07T20:25:43.2780309Z 
2025-05-07T20:25:43.3384401Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3384754Z 
2025-05-07T20:25:43.3384776Z 
2025-05-07T20:25:43.3384780Z 
2025-05-07T20:25:43.3384783Z 
2025-05-07T20:25:43.3384787Z 
2025-05-07T20:25:43.3384792Z 
2025-05-07T20:25:43.3384795Z 
2025-05-07T20:25:43.3384800Z 
2025-05-07T20:25:43.3384803Z 
2025-05-07T20:25:43.3384818Z 
2025-05-07T20:25:43.3387188Z 
2025-05-07T20:25:43.3697612Z libnvjitlink-12.8.61 | 28.7 MB   | ######9    |  69% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3697960Z 
2025-05-07T20:25:43.3697964Z 
2025-05-07T20:25:43.3697968Z 
2025-05-07T20:25:43.3697971Z 
2025-05-07T20:25:43.3697975Z 
2025-05-07T20:25:43.3697979Z 
2025-05-07T20:25:43.3697982Z 
2025-05-07T20:25:43.3697986Z 
2025-05-07T20:25:43.3697989Z 
2025-05-07T20:25:43.3697993Z 
2025-05-07T20:25:43.3698005Z 
2025-05-07T20:25:43.3698009Z 
2025-05-07T20:25:43.3700404Z 
2025-05-07T20:25:43.3781726Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##4        |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3782194Z 
2025-05-07T20:25:43.3782199Z 
2025-05-07T20:25:43.3782202Z 
2025-05-07T20:25:43.3782206Z 
2025-05-07T20:25:43.3782225Z 
2025-05-07T20:25:43.3782229Z 
2025-05-07T20:25:43.3782233Z 
2025-05-07T20:25:43.3782237Z 
2025-05-07T20:25:43.3782240Z 
2025-05-07T20:25:43.3782244Z 
2025-05-07T20:25:43.3782248Z 
2025-05-07T20:25:43.3782251Z 
2025-05-07T20:25:43.4277456Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4277800Z 
2025-05-07T20:25:43.4277805Z 
2025-05-07T20:25:43.4281086Z 
2025-05-07T20:25:43.4392132Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:43.4392501Z 
2025-05-07T20:25:43.4392505Z 
2025-05-07T20:25:43.4392508Z 
2025-05-07T20:25:43.4392512Z 
2025-05-07T20:25:43.4392515Z 
2025-05-07T20:25:43.4392523Z 
2025-05-07T20:25:43.4392527Z 
2025-05-07T20:25:43.4392532Z 
2025-05-07T20:25:43.4392535Z 
2025-05-07T20:25:43.4392539Z 
2025-05-07T20:25:43.4396440Z 
2025-05-07T20:25:43.4698664Z libnvjitlink-12.8.61 | 28.7 MB   | #######7   |  78% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4699106Z 
2025-05-07T20:25:43.4699349Z 
2025-05-07T20:25:43.4699353Z 
2025-05-07T20:25:43.4699357Z 
2025-05-07T20:25:43.4699361Z 
2025-05-07T20:25:43.4699364Z 
2025-05-07T20:25:43.4699368Z 
2025-05-07T20:25:43.4699372Z 
2025-05-07T20:25:43.4699376Z 
2025-05-07T20:25:43.4699386Z 
2025-05-07T20:25:43.4699390Z 
2025-05-07T20:25:43.4699394Z 
2025-05-07T20:25:43.4699397Z 
2025-05-07T20:25:43.4782125Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4782452Z 
2025-05-07T20:25:43.4782456Z 
2025-05-07T20:25:43.4782460Z 
2025-05-07T20:25:43.4782464Z 
2025-05-07T20:25:43.4782476Z 
2025-05-07T20:25:43.4782480Z 
2025-05-07T20:25:43.4782484Z 
2025-05-07T20:25:43.4782487Z 
2025-05-07T20:25:43.4782492Z 
2025-05-07T20:25:43.4782495Z 
2025-05-07T20:25:43.4782499Z 
2025-05-07T20:25:43.4785131Z 
2025-05-07T20:25:43.5394580Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5394920Z 
2025-05-07T20:25:43.5394924Z 
2025-05-07T20:25:43.5394956Z 
2025-05-07T20:25:43.5394960Z 
2025-05-07T20:25:43.5394964Z 
2025-05-07T20:25:43.5394968Z 
2025-05-07T20:25:43.5394971Z 
2025-05-07T20:25:43.5394975Z 
2025-05-07T20:25:43.5394979Z 
2025-05-07T20:25:43.5394982Z 
2025-05-07T20:25:43.5397283Z 
2025-05-07T20:25:43.5701295Z libnvjitlink-12.8.61 | 28.7 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5701716Z 
2025-05-07T20:25:43.5701722Z 
2025-05-07T20:25:43.5701725Z 
2025-05-07T20:25:43.5701730Z 
2025-05-07T20:25:43.5701734Z 
2025-05-07T20:25:43.5701738Z 
2025-05-07T20:25:43.5701742Z 
2025-05-07T20:25:43.5701746Z 
2025-05-07T20:25:43.5701750Z 
2025-05-07T20:25:43.5701753Z 
2025-05-07T20:25:43.5701757Z 
2025-05-07T20:25:43.5701761Z 
2025-05-07T20:25:43.5701773Z 
2025-05-07T20:25:43.5802659Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5803075Z 
2025-05-07T20:25:43.5803079Z 
2025-05-07T20:25:43.5803083Z 
2025-05-07T20:25:43.5803098Z 
2025-05-07T20:25:43.5803117Z 
2025-05-07T20:25:43.5803121Z 
2025-05-07T20:25:43.5803126Z 
2025-05-07T20:25:43.5803129Z 
2025-05-07T20:25:43.5803133Z 
2025-05-07T20:25:43.5803137Z 
2025-05-07T20:25:43.5803140Z 
2025-05-07T20:25:43.5803144Z 
2025-05-07T20:25:43.6399640Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6399973Z 
2025-05-07T20:25:43.6399977Z 
2025-05-07T20:25:43.6399980Z 
2025-05-07T20:25:43.6399984Z 
2025-05-07T20:25:43.6399988Z 
2025-05-07T20:25:43.6399991Z 
2025-05-07T20:25:43.6399995Z 
2025-05-07T20:25:43.6399999Z 
2025-05-07T20:25:43.6400002Z 
2025-05-07T20:25:43.6400006Z 
2025-05-07T20:25:43.6400010Z 
2025-05-07T20:25:43.6721105Z libnvjitlink-12.8.61 | 28.7 MB   | #########5 |  95% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6721424Z 
2025-05-07T20:25:43.6721428Z 
2025-05-07T20:25:43.6721432Z 
2025-05-07T20:25:43.6721436Z 
2025-05-07T20:25:43.6721439Z 
2025-05-07T20:25:43.6721444Z 
2025-05-07T20:25:43.6721447Z 
2025-05-07T20:25:43.6721464Z 
2025-05-07T20:25:43.6721468Z 
2025-05-07T20:25:43.6721471Z 
2025-05-07T20:25:43.6721475Z 
2025-05-07T20:25:43.6721487Z 
2025-05-07T20:25:43.6721491Z 
2025-05-07T20:25:43.7734815Z cuda-nvvm-tools-12.8 | 23.5 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7735153Z 
2025-05-07T20:25:43.7735158Z 
2025-05-07T20:25:43.7735170Z 
2025-05-07T20:25:43.7735174Z 
2025-05-07T20:25:43.7735177Z 
2025-05-07T20:25:43.7735181Z 
2025-05-07T20:25:43.7735184Z 
2025-05-07T20:25:43.7735189Z 
2025-05-07T20:25:43.7735197Z 
2025-05-07T20:25:43.7735201Z 
2025-05-07T20:25:43.7735206Z 
2025-05-07T20:25:43.7735210Z 
2025-05-07T20:25:43.7735508Z 
2025-05-07T20:25:43.8735884Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######6   |  76% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8736230Z 
2025-05-07T20:25:43.8736234Z 
2025-05-07T20:25:43.8736238Z 
2025-05-07T20:25:43.8736241Z 
2025-05-07T20:25:43.8736245Z 
2025-05-07T20:25:43.8736248Z 
2025-05-07T20:25:43.8736252Z 
2025-05-07T20:25:43.8736502Z 
2025-05-07T20:25:43.8736505Z 
2025-05-07T20:25:43.8736509Z 
2025-05-07T20:25:43.8736513Z 
2025-05-07T20:25:43.8736516Z 
2025-05-07T20:25:43.8741946Z 
2025-05-07T20:25:44.4318826Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4319310Z 
2025-05-07T20:25:44.4319316Z 
2025-05-07T20:25:44.4319321Z 
2025-05-07T20:25:44.4319326Z 
2025-05-07T20:25:44.4319341Z 
2025-05-07T20:25:44.4319347Z 
2025-05-07T20:25:44.4319352Z 
2025-05-07T20:25:44.4319357Z 
2025-05-07T20:25:44.4319361Z 
2025-05-07T20:25:44.4319366Z 
2025-05-07T20:25:44.4319371Z 
2025-05-07T20:25:44.4319375Z 
2025-05-07T20:25:44.5016723Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5017224Z 
2025-05-07T20:25:44.5017231Z 
2025-05-07T20:25:44.5017237Z 
2025-05-07T20:25:44.5017242Z 
2025-05-07T20:25:44.5017248Z 
2025-05-07T20:25:44.5017253Z 
2025-05-07T20:25:44.5017260Z 
2025-05-07T20:25:44.5017267Z 
2025-05-07T20:25:44.5017299Z 
2025-05-07T20:25:44.5017304Z 
2025-05-07T20:25:44.5017310Z 
2025-05-07T20:25:44.5017315Z 
2025-05-07T20:25:44.5017320Z 
2025-05-07T20:25:44.5017325Z 
2025-05-07T20:25:44.5532690Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5533027Z 
2025-05-07T20:25:44.5999280Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:25:44.5999717Z 
2025-05-07T20:25:44.5999724Z 
2025-05-07T20:25:44.5999731Z 
2025-05-07T20:25:44.5999737Z 
2025-05-07T20:25:44.5999743Z 
2025-05-07T20:25:44.5999748Z 
2025-05-07T20:25:44.5999752Z 
2025-05-07T20:25:44.5999757Z 
2025-05-07T20:25:44.5999764Z 
2025-05-07T20:25:44.5999769Z 
2025-05-07T20:25:44.5999775Z 
2025-05-07T20:25:44.5999780Z 
2025-05-07T20:25:44.5999786Z 
2025-05-07T20:25:44.5999791Z 
2025-05-07T20:25:44.6001463Z 
2025-05-07T20:25:44.6019289Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6019626Z 
2025-05-07T20:25:44.6019651Z 
2025-05-07T20:25:44.6019655Z 
2025-05-07T20:25:44.6019659Z 
2025-05-07T20:25:44.6019673Z 
2025-05-07T20:25:44.6019678Z 
2025-05-07T20:25:44.6019681Z 
2025-05-07T20:25:44.6019685Z 
2025-05-07T20:25:44.6019689Z 
2025-05-07T20:25:44.6019703Z 
2025-05-07T20:25:44.6019706Z 
2025-05-07T20:25:44.6019710Z 
2025-05-07T20:25:44.6019713Z 
2025-05-07T20:25:44.6019717Z 
2025-05-07T20:25:44.6552256Z cuda-nvvm-impl-12.8. | 20.8 MB   | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6552760Z 
2025-05-07T20:25:44.6552767Z 
2025-05-07T20:25:44.6552772Z 
2025-05-07T20:25:44.6552777Z 
2025-05-07T20:25:44.6552781Z 
2025-05-07T20:25:44.6552786Z 
2025-05-07T20:25:44.6552791Z 
2025-05-07T20:25:44.6552797Z 
2025-05-07T20:25:44.6552802Z 
2025-05-07T20:25:44.6552807Z 
2025-05-07T20:25:44.6552812Z 
2025-05-07T20:25:44.6552818Z 
2025-05-07T20:25:44.6552822Z 
2025-05-07T20:25:44.6931083Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6931451Z 
2025-05-07T20:25:44.6931455Z 
2025-05-07T20:25:44.6931459Z 
2025-05-07T20:25:44.6931463Z 
2025-05-07T20:25:44.6931466Z 
2025-05-07T20:25:44.6931470Z 
2025-05-07T20:25:44.6931474Z 
2025-05-07T20:25:44.6931477Z 
2025-05-07T20:25:44.6931727Z 
2025-05-07T20:25:44.6931732Z 
2025-05-07T20:25:44.6931736Z 
2025-05-07T20:25:44.6931740Z 
2025-05-07T20:25:44.6931743Z 
2025-05-07T20:25:44.6931747Z 
2025-05-07T20:25:44.6931751Z 
2025-05-07T20:25:44.6931754Z 
2025-05-07T20:25:44.7005076Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7005557Z 
2025-05-07T20:25:44.7005563Z 
2025-05-07T20:25:44.7005568Z 
2025-05-07T20:25:44.7005573Z 
2025-05-07T20:25:44.7005578Z 
2025-05-07T20:25:44.7005584Z 
2025-05-07T20:25:44.7005589Z 
2025-05-07T20:25:44.7005594Z 
2025-05-07T20:25:44.7005599Z 
2025-05-07T20:25:44.7005604Z 
2025-05-07T20:25:44.7005609Z 
2025-05-07T20:25:44.7005614Z 
2025-05-07T20:25:44.7005620Z 
2025-05-07T20:25:44.7005625Z 
2025-05-07T20:25:44.7006877Z 
2025-05-07T20:25:44.7039031Z cuda-nvcc-dev_linux- | 12.7 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7039484Z 
2025-05-07T20:25:44.7039489Z 
2025-05-07T20:25:44.7039494Z 
2025-05-07T20:25:44.7039512Z 
2025-05-07T20:25:44.7039518Z 
2025-05-07T20:25:44.7039523Z 
2025-05-07T20:25:44.7039528Z 
2025-05-07T20:25:44.7039533Z 
2025-05-07T20:25:44.7039551Z 
2025-05-07T20:25:44.7039556Z 
2025-05-07T20:25:44.7039561Z 
2025-05-07T20:25:44.7039566Z 
2025-05-07T20:25:44.7039572Z 
2025-05-07T20:25:44.7039576Z 
2025-05-07T20:25:44.7095207Z cuda-nvvm-impl-12.8. | 20.8 MB   | ###5       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7095665Z 
2025-05-07T20:25:44.7095909Z 
2025-05-07T20:25:44.7095916Z 
2025-05-07T20:25:44.7095921Z 
2025-05-07T20:25:44.7095926Z 
2025-05-07T20:25:44.7095931Z 
2025-05-07T20:25:44.7095936Z 
2025-05-07T20:25:44.7095941Z 
2025-05-07T20:25:44.7095947Z 
2025-05-07T20:25:44.7095951Z 
2025-05-07T20:25:44.7095983Z 
2025-05-07T20:25:44.7582958Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7583399Z 
2025-05-07T20:25:44.7583406Z 
2025-05-07T20:25:44.7583411Z 
2025-05-07T20:25:44.7583416Z 
2025-05-07T20:25:44.7583438Z 
2025-05-07T20:25:44.7583444Z 
2025-05-07T20:25:44.7583449Z 
2025-05-07T20:25:44.7583454Z 
2025-05-07T20:25:44.7583459Z 
2025-05-07T20:25:44.7583464Z 
2025-05-07T20:25:44.7583470Z 
2025-05-07T20:25:44.7583475Z 
2025-05-07T20:25:44.7583480Z 
2025-05-07T20:25:44.7583485Z 
2025-05-07T20:25:44.7583490Z 
2025-05-07T20:25:44.7583495Z 
2025-05-07T20:25:44.7588083Z 
2025-05-07T20:25:44.7931710Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7932286Z 
2025-05-07T20:25:44.7932292Z 
2025-05-07T20:25:44.7932307Z 
2025-05-07T20:25:44.7932312Z 
2025-05-07T20:25:44.7932318Z 
2025-05-07T20:25:44.7932323Z 
2025-05-07T20:25:44.7932328Z 
2025-05-07T20:25:44.7932334Z 
2025-05-07T20:25:44.7932357Z 
2025-05-07T20:25:44.7932362Z 
2025-05-07T20:25:44.7932368Z 
2025-05-07T20:25:44.7932373Z 
2025-05-07T20:25:44.7932378Z 
2025-05-07T20:25:44.7932383Z 
2025-05-07T20:25:44.7932388Z 
2025-05-07T20:25:44.7932393Z 
2025-05-07T20:25:44.8102301Z cuda-sanitizer-api-1 | 8.8 MB    | ###4       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8102782Z 
2025-05-07T20:25:44.8102789Z 
2025-05-07T20:25:44.8102794Z 
2025-05-07T20:25:44.8102799Z 
2025-05-07T20:25:44.8102804Z 
2025-05-07T20:25:44.8102809Z 
2025-05-07T20:25:44.8102814Z 
2025-05-07T20:25:44.8102819Z 
2025-05-07T20:25:44.8102824Z 
2025-05-07T20:25:44.8102830Z 
2025-05-07T20:25:44.8102835Z 
2025-05-07T20:25:44.8102840Z 
2025-05-07T20:25:44.8102845Z 
2025-05-07T20:25:44.8102859Z 
2025-05-07T20:25:44.8106507Z 
2025-05-07T20:25:44.8306117Z cuda-nvcc-dev_linux- | 12.7 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8306760Z 
2025-05-07T20:25:44.8306776Z 
2025-05-07T20:25:44.8306781Z 
2025-05-07T20:25:44.8306799Z 
2025-05-07T20:25:44.8306805Z 
2025-05-07T20:25:44.8306810Z 
2025-05-07T20:25:44.8306815Z 
2025-05-07T20:25:44.8306820Z 
2025-05-07T20:25:44.8306825Z 
2025-05-07T20:25:44.8306830Z 
2025-05-07T20:25:44.8306835Z 
2025-05-07T20:25:44.8307081Z 
2025-05-07T20:25:44.8307087Z 
2025-05-07T20:25:44.8308380Z 
2025-05-07T20:25:44.8583187Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####2     |  53% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8583653Z 
2025-05-07T20:25:44.8583659Z 
2025-05-07T20:25:44.8583664Z 
2025-05-07T20:25:44.8583670Z 
2025-05-07T20:25:44.8583674Z 
2025-05-07T20:25:44.8583680Z 
2025-05-07T20:25:44.8583685Z 
2025-05-07T20:25:44.8583690Z 
2025-05-07T20:25:44.8583695Z 
2025-05-07T20:25:44.8583700Z 
2025-05-07T20:25:44.8583705Z 
2025-05-07T20:25:44.8583711Z 
2025-05-07T20:25:44.8583716Z 
2025-05-07T20:25:44.8583721Z 
2025-05-07T20:25:44.8583735Z 
2025-05-07T20:25:44.8583741Z 
2025-05-07T20:25:44.8584963Z 
2025-05-07T20:25:44.9134464Z cuda-nvdisasm-12.8.5 | 4.9 MB    | #####6     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9135260Z 
2025-05-07T20:25:44.9135265Z 
2025-05-07T20:25:44.9135269Z 
2025-05-07T20:25:44.9135272Z 
2025-05-07T20:25:44.9135276Z 
2025-05-07T20:25:44.9135288Z 
2025-05-07T20:25:44.9135292Z 
2025-05-07T20:25:44.9135295Z 
2025-05-07T20:25:44.9135299Z 
2025-05-07T20:25:44.9135303Z 
2025-05-07T20:25:44.9135307Z 
2025-05-07T20:25:44.9135310Z 
2025-05-07T20:25:44.9135314Z 
2025-05-07T20:25:44.9135318Z 
2025-05-07T20:25:44.9135321Z 
2025-05-07T20:25:44.9136720Z 
2025-05-07T20:25:44.9234833Z cuda-sanitizer-api-1 | 8.8 MB    | ######9    |  69% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9235308Z 
2025-05-07T20:25:44.9235313Z 
2025-05-07T20:25:44.9235319Z 
2025-05-07T20:25:44.9235324Z 
2025-05-07T20:25:44.9235329Z 
2025-05-07T20:25:44.9235334Z 
2025-05-07T20:25:44.9235340Z 
2025-05-07T20:25:44.9235345Z 
2025-05-07T20:25:44.9235350Z 
2025-05-07T20:25:44.9235367Z 
2025-05-07T20:25:44.9235372Z 
2025-05-07T20:25:44.9235392Z 
2025-05-07T20:25:44.9235397Z 
2025-05-07T20:25:44.9235402Z 
2025-05-07T20:25:44.9239787Z 
2025-05-07T20:25:44.9482612Z cuda-nvcc-dev_linux- | 12.7 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9483095Z 
2025-05-07T20:25:44.9483101Z 
2025-05-07T20:25:44.9483107Z 
2025-05-07T20:25:44.9483112Z 
2025-05-07T20:25:44.9483117Z 
2025-05-07T20:25:44.9483122Z 
2025-05-07T20:25:44.9483127Z 
2025-05-07T20:25:44.9483132Z 
2025-05-07T20:25:44.9483138Z 
2025-05-07T20:25:44.9483143Z 
2025-05-07T20:25:44.9483148Z 
2025-05-07T20:25:44.9483153Z 
2025-05-07T20:25:44.9483158Z 
2025-05-07T20:25:44.9489998Z 
2025-05-07T20:25:45.0235656Z cuda-nvvm-impl-12.8. | 20.8 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0236101Z 
2025-05-07T20:25:45.0236107Z 
2025-05-07T20:25:45.0236112Z 
2025-05-07T20:25:45.0236117Z 
2025-05-07T20:25:45.0236122Z 
2025-05-07T20:25:45.0236128Z 
2025-05-07T20:25:45.0236133Z 
2025-05-07T20:25:45.0236138Z 
2025-05-07T20:25:45.0236177Z 
2025-05-07T20:25:45.0236182Z 
2025-05-07T20:25:45.0236187Z 
2025-05-07T20:25:45.0236192Z 
2025-05-07T20:25:45.0236197Z 
2025-05-07T20:25:45.0236202Z 
2025-05-07T20:25:45.0237501Z 
2025-05-07T20:25:45.0545271Z cuda-nvcc-dev_linux- | 12.7 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0545737Z 
2025-05-07T20:25:45.0545742Z 
2025-05-07T20:25:45.0545747Z 
2025-05-07T20:25:45.0545753Z 
2025-05-07T20:25:45.0545758Z 
2025-05-07T20:25:45.0545763Z 
2025-05-07T20:25:45.0545768Z 
2025-05-07T20:25:45.0545773Z 
2025-05-07T20:25:45.0545778Z 
2025-05-07T20:25:45.0545783Z 
2025-05-07T20:25:45.0545788Z 
2025-05-07T20:25:45.0545793Z 
2025-05-07T20:25:45.0545798Z 
2025-05-07T20:25:45.0545804Z 
2025-05-07T20:25:45.1069063Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########3  |  83% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1069506Z 
2025-05-07T20:25:45.1069512Z 
2025-05-07T20:25:45.1069517Z 
2025-05-07T20:25:45.1069522Z 
2025-05-07T20:25:45.1069527Z 
2025-05-07T20:25:45.1069550Z 
2025-05-07T20:25:45.1069555Z 
2025-05-07T20:25:45.1069560Z 
2025-05-07T20:25:45.1069565Z 
2025-05-07T20:25:45.1069570Z 
2025-05-07T20:25:45.1069585Z 
2025-05-07T20:25:45.1069590Z 
2025-05-07T20:25:45.1069595Z 
2025-05-07T20:25:45.1069846Z 
2025-05-07T20:25:45.1069852Z 
2025-05-07T20:25:45.1069856Z 
2025-05-07T20:25:45.1070440Z 
2025-05-07T20:25:45.1548009Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1548493Z 
2025-05-07T20:25:45.1548499Z 
2025-05-07T20:25:45.1548504Z 
2025-05-07T20:25:45.1548509Z 
2025-05-07T20:25:45.1548514Z 
2025-05-07T20:25:45.1548519Z 
2025-05-07T20:25:45.1548523Z 
2025-05-07T20:25:45.1548531Z 
2025-05-07T20:25:45.1548536Z 
2025-05-07T20:25:45.1548542Z 
2025-05-07T20:25:45.1548549Z 
2025-05-07T20:25:45.1548555Z 
2025-05-07T20:25:45.1548560Z 
2025-05-07T20:25:45.1548566Z 
2025-05-07T20:25:45.1548571Z 
2025-05-07T20:25:45.1548576Z 
2025-05-07T20:25:45.1548581Z 
2025-05-07T20:25:45.1548841Z 
2025-05-07T20:25:45.2549621Z cuda-cupti-dev-12.8. | 4.0 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2550075Z 
2025-05-07T20:25:45.2550081Z 
2025-05-07T20:25:45.2550086Z 
2025-05-07T20:25:45.2550124Z 
2025-05-07T20:25:45.2550129Z 
2025-05-07T20:25:45.2550135Z 
2025-05-07T20:25:45.2550139Z 
2025-05-07T20:25:45.2550145Z 
2025-05-07T20:25:45.2550150Z 
2025-05-07T20:25:45.2550155Z 
2025-05-07T20:25:45.2550160Z 
2025-05-07T20:25:45.2550165Z 
2025-05-07T20:25:45.2550170Z 
2025-05-07T20:25:45.2550175Z 
2025-05-07T20:25:45.2550180Z 
2025-05-07T20:25:45.2550186Z 
2025-05-07T20:25:45.2550191Z 
2025-05-07T20:25:45.2551228Z 
2025-05-07T20:25:45.3650165Z cuda-cupti-dev-12.8. | 4.0 MB    | ########   |  81% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3650632Z 
2025-05-07T20:25:45.3650639Z 
2025-05-07T20:25:45.3650644Z 
2025-05-07T20:25:45.3650649Z 
2025-05-07T20:25:45.3650655Z 
2025-05-07T20:25:45.3650662Z 
2025-05-07T20:25:45.3650689Z 
2025-05-07T20:25:45.3650694Z 
2025-05-07T20:25:45.3650699Z 
2025-05-07T20:25:45.3650704Z 
2025-05-07T20:25:45.3650709Z 
2025-05-07T20:25:45.3650714Z 
2025-05-07T20:25:45.3650728Z 
2025-05-07T20:25:45.3650733Z 
2025-05-07T20:25:45.3650749Z 
2025-05-07T20:25:45.3650754Z 
2025-05-07T20:25:45.4045091Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4045573Z 
2025-05-07T20:25:45.4045579Z 
2025-05-07T20:25:45.4045584Z 
2025-05-07T20:25:45.4045589Z 
2025-05-07T20:25:45.4045594Z 
2025-05-07T20:25:45.4045609Z 
2025-05-07T20:25:45.4045614Z 
2025-05-07T20:25:45.4045619Z 
2025-05-07T20:25:45.4045624Z 
2025-05-07T20:25:45.4045629Z 
2025-05-07T20:25:45.4045634Z 
2025-05-07T20:25:45.4045639Z 
2025-05-07T20:25:45.4045644Z 
2025-05-07T20:25:45.4045649Z 
2025-05-07T20:25:45.4045655Z 
2025-05-07T20:25:45.4045659Z 
2025-05-07T20:25:45.4045665Z 
2025-05-07T20:25:45.4045669Z 
2025-05-07T20:25:45.4045675Z 
2025-05-07T20:25:45.4068093Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4068506Z 
2025-05-07T20:25:45.4068512Z 
2025-05-07T20:25:45.4068517Z 
2025-05-07T20:25:45.4068522Z 
2025-05-07T20:25:45.4068527Z 
2025-05-07T20:25:45.4068539Z 
2025-05-07T20:25:45.4068544Z 
2025-05-07T20:25:45.4068559Z 
2025-05-07T20:25:45.4068565Z 
2025-05-07T20:25:45.4068570Z 
2025-05-07T20:25:45.4068575Z 
2025-05-07T20:25:45.4068580Z 
2025-05-07T20:25:45.4068585Z 
2025-05-07T20:25:45.4068590Z 
2025-05-07T20:25:45.4068595Z 
2025-05-07T20:25:45.4068600Z 
2025-05-07T20:25:45.4068605Z 
2025-05-07T20:25:45.4072869Z 
2025-05-07T20:25:45.4991115Z cuda-cupti-dev-12.8. | 4.0 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4991598Z 
2025-05-07T20:25:45.4991604Z 
2025-05-07T20:25:45.4991608Z 
2025-05-07T20:25:45.4991614Z 
2025-05-07T20:25:45.4991619Z 
2025-05-07T20:25:45.4991624Z 
2025-05-07T20:25:45.4991629Z 
2025-05-07T20:25:45.4991633Z 
2025-05-07T20:25:45.4991639Z 
2025-05-07T20:25:45.4991659Z 
2025-05-07T20:25:45.4991664Z 
2025-05-07T20:25:45.4991669Z 
2025-05-07T20:25:45.4991674Z 
2025-05-07T20:25:45.4991680Z 
2025-05-07T20:25:45.4991685Z 
2025-05-07T20:25:45.5047900Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5048372Z 
2025-05-07T20:25:45.5048377Z 
2025-05-07T20:25:45.5048382Z 
2025-05-07T20:25:45.5048399Z 
2025-05-07T20:25:45.5048404Z 
2025-05-07T20:25:45.5048409Z 
2025-05-07T20:25:45.5048414Z 
2025-05-07T20:25:45.5048419Z 
2025-05-07T20:25:45.5048424Z 
2025-05-07T20:25:45.5048429Z 
2025-05-07T20:25:45.5048434Z 
2025-05-07T20:25:45.5048439Z 
2025-05-07T20:25:45.5048444Z 
2025-05-07T20:25:45.5048449Z 
2025-05-07T20:25:45.5048454Z 
2025-05-07T20:25:45.5048459Z 
2025-05-07T20:25:45.5048464Z 
2025-05-07T20:25:45.5048469Z 
2025-05-07T20:25:45.5048474Z 
2025-05-07T20:25:45.6695438Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6695851Z 
2025-05-07T20:25:45.6696153Z 
2025-05-07T20:25:45.6696157Z 
2025-05-07T20:25:45.6696161Z 
2025-05-07T20:25:45.6696165Z 
2025-05-07T20:25:45.6696168Z 
2025-05-07T20:25:45.6696172Z 
2025-05-07T20:25:45.6696176Z 
2025-05-07T20:25:45.6696179Z 
2025-05-07T20:25:45.6696198Z 
2025-05-07T20:25:45.6696202Z 
2025-05-07T20:25:45.6696221Z 
2025-05-07T20:25:45.6696224Z 
2025-05-07T20:25:45.6696228Z 
2025-05-07T20:25:45.6696231Z 
2025-05-07T20:25:45.6696235Z 
2025-05-07T20:25:45.6696239Z 
2025-05-07T20:25:45.6696242Z 
2025-05-07T20:25:45.6698338Z 
2025-05-07T20:25:45.7984662Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7985081Z 
2025-05-07T20:25:45.7985087Z 
2025-05-07T20:25:45.7985092Z 
2025-05-07T20:25:45.7985097Z 
2025-05-07T20:25:45.7985102Z 
2025-05-07T20:25:45.7985108Z 
2025-05-07T20:25:45.7985113Z 
2025-05-07T20:25:45.7985118Z 
2025-05-07T20:25:45.7985124Z 
2025-05-07T20:25:45.7985129Z 
2025-05-07T20:25:45.7985134Z 
2025-05-07T20:25:45.7985140Z 
2025-05-07T20:25:45.7985145Z 
2025-05-07T20:25:45.7987742Z 
2025-05-07T20:25:45.7988314Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7988737Z 
2025-05-07T20:25:45.7988742Z 
2025-05-07T20:25:45.7988748Z 
2025-05-07T20:25:45.7988767Z 
2025-05-07T20:25:45.7988772Z 
2025-05-07T20:25:45.7988785Z 
2025-05-07T20:25:45.7988790Z 
2025-05-07T20:25:45.7988796Z 
2025-05-07T20:25:45.7988813Z 
2025-05-07T20:25:45.7988819Z 
2025-05-07T20:25:45.7988824Z 
2025-05-07T20:25:45.7988829Z 
2025-05-07T20:25:45.7988834Z 
2025-05-07T20:25:45.7988839Z 
2025-05-07T20:25:46.3785132Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3785553Z 
2025-05-07T20:25:46.3785557Z 
2025-05-07T20:25:46.3785561Z 
2025-05-07T20:25:46.3785565Z 
2025-05-07T20:25:46.3785569Z 
2025-05-07T20:25:46.3785572Z 
2025-05-07T20:25:47.6749554Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:47.6749937Z 
2025-05-07T20:25:47.6749970Z 
2025-05-07T20:25:47.6749974Z 
2025-05-07T20:25:47.6749978Z 
2025-05-07T20:25:47.6750499Z 
2025-05-07T20:25:47.9341275Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:47.9341686Z 
2025-05-07T20:25:47.9341724Z 
2025-05-07T20:25:47.9341741Z 
2025-05-07T20:25:47.9341747Z 
2025-05-07T20:25:47.9341752Z 
2025-05-07T20:25:47.9341758Z 
2025-05-07T20:25:47.9341763Z 
2025-05-07T20:25:47.9341768Z 
2025-05-07T20:25:47.9341773Z 
2025-05-07T20:25:48.0383598Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.0384076Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:25:48.1159237Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:25:48.1159537Z 
2025-05-07T20:25:48.1159541Z 
2025-05-07T20:25:48.1159545Z 
2025-05-07T20:25:48.1159549Z 
2025-05-07T20:25:48.1159553Z 
2025-05-07T20:25:48.1159558Z 
2025-05-07T20:25:48.1159564Z 
2025-05-07T20:25:48.1498515Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:48.1498864Z 
2025-05-07T20:25:48.1498869Z 
2025-05-07T20:25:48.1498872Z 
2025-05-07T20:25:48.1498876Z 
2025-05-07T20:25:48.1498880Z 
2025-05-07T20:25:48.1498884Z 
2025-05-07T20:25:48.1499149Z 
2025-05-07T20:25:48.1499154Z 
2025-05-07T20:25:48.1499158Z 
2025-05-07T20:25:48.1499161Z 
2025-05-07T20:25:48.5161447Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.5161920Z 
2025-05-07T20:25:48.5161927Z 
2025-05-07T20:25:48.5161932Z 
2025-05-07T20:25:48.5161937Z 
2025-05-07T20:25:48.5161942Z 
2025-05-07T20:25:48.5161947Z 
2025-05-07T20:25:48.5161952Z 
2025-05-07T20:25:48.5161958Z 
2025-05-07T20:25:48.5161963Z 
2025-05-07T20:25:48.5161968Z 
2025-05-07T20:25:48.5161974Z 
2025-05-07T20:25:48.5161979Z 
2025-05-07T20:25:48.8343984Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.8344326Z 
2025-05-07T20:25:48.8344331Z 
2025-05-07T20:25:48.8344347Z 
2025-05-07T20:25:48.8344658Z 
2025-05-07T20:25:48.8344662Z 
2025-05-07T20:25:48.8344666Z 
2025-05-07T20:25:48.8344671Z 
2025-05-07T20:25:48.8344674Z 
2025-05-07T20:25:48.8344679Z 
2025-05-07T20:25:48.8344683Z 
2025-05-07T20:25:48.8344687Z 
2025-05-07T20:25:48.8344707Z 
2025-05-07T20:25:48.8344711Z 
2025-05-07T20:25:48.9216471Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.9216820Z 
2025-05-07T20:25:48.9216824Z 
2025-05-07T20:25:48.9216827Z 
2025-05-07T20:25:48.9216835Z 
2025-05-07T20:25:48.9216839Z 
2025-05-07T20:25:48.9216843Z 
2025-05-07T20:25:48.9216846Z 
2025-05-07T20:25:48.9216854Z 
2025-05-07T20:25:48.9556063Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:48.9556368Z 
2025-05-07T20:25:48.9556373Z 
2025-05-07T20:25:48.9556376Z 
2025-05-07T20:25:48.9556380Z 
2025-05-07T20:25:48.9556384Z 
2025-05-07T20:25:48.9556388Z 
2025-05-07T20:25:48.9556401Z 
2025-05-07T20:25:48.9556406Z 
2025-05-07T20:25:48.9556437Z 
2025-05-07T20:25:48.9556441Z 
2025-05-07T20:25:48.9556445Z 
2025-05-07T20:25:48.9556448Z 
2025-05-07T20:25:48.9556452Z 
2025-05-07T20:25:48.9556455Z 
2025-05-07T20:25:48.9556460Z 
2025-05-07T20:25:48.9556463Z 
2025-05-07T20:25:48.9556477Z 
2025-05-07T20:25:49.1358318Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.1358677Z 
2025-05-07T20:25:49.1358681Z 
2025-05-07T20:25:49.1358685Z 
2025-05-07T20:25:49.1358688Z 
2025-05-07T20:25:49.1358693Z 
2025-05-07T20:25:49.1358696Z 
2025-05-07T20:25:49.1358701Z 
2025-05-07T20:25:49.1358704Z 
2025-05-07T20:25:49.1358709Z 
2025-05-07T20:25:49.1358713Z 
2025-05-07T20:25:49.1358717Z 
2025-05-07T20:25:49.1358720Z 
2025-05-07T20:25:49.1358724Z 
2025-05-07T20:25:49.1358728Z 
2025-05-07T20:25:49.1358731Z 
2025-05-07T20:25:49.1358735Z 
2025-05-07T20:25:49.2740882Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2741242Z 
2025-05-07T20:25:49.2741277Z 
2025-05-07T20:25:49.2741281Z 
2025-05-07T20:25:49.2741284Z 
2025-05-07T20:25:49.2741288Z 
2025-05-07T20:25:49.2741302Z 
2025-05-07T20:25:49.2741306Z 
2025-05-07T20:25:49.2741310Z 
2025-05-07T20:25:49.2741314Z 
2025-05-07T20:25:49.2741331Z 
2025-05-07T20:25:49.2741335Z 
2025-05-07T20:25:49.2794474Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2794798Z 
2025-05-07T20:25:49.2794802Z 
2025-05-07T20:25:49.2794806Z 
2025-05-07T20:25:49.2794809Z 
2025-05-07T20:25:49.2794813Z 
2025-05-07T20:25:49.2794818Z 
2025-05-07T20:25:49.2794821Z 
2025-05-07T20:25:49.2794825Z 
2025-05-07T20:25:49.2794829Z 
2025-05-07T20:25:49.2794832Z 
2025-05-07T20:25:49.2794836Z 
2025-05-07T20:25:49.2794840Z 
2025-05-07T20:25:49.2794843Z 
2025-05-07T20:25:49.2794847Z 
2025-05-07T20:25:49.2794850Z 
2025-05-07T20:25:49.2794854Z 
2025-05-07T20:25:49.2794858Z 
2025-05-07T20:25:49.2795188Z 
2025-05-07T20:25:49.4914186Z cuda-cupti-dev-12.8. | 4.0 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.4914571Z 
2025-05-07T20:25:49.4914575Z 
2025-05-07T20:25:49.4914580Z 
2025-05-07T20:25:49.4914583Z 
2025-05-07T20:25:49.4914587Z 
2025-05-07T20:25:49.4914826Z 
2025-05-07T20:25:49.4914831Z 
2025-05-07T20:25:49.4914835Z 
2025-05-07T20:25:49.4914839Z 
2025-05-07T20:25:49.4914842Z 
2025-05-07T20:25:49.4914846Z 
2025-05-07T20:25:49.4914850Z 
2025-05-07T20:25:49.4914853Z 
2025-05-07T20:25:49.4914857Z 
2025-05-07T20:25:49.4914860Z 
2025-05-07T20:25:49.4914864Z 
2025-05-07T20:25:49.4914868Z 
2025-05-07T20:25:49.4914871Z 
2025-05-07T20:25:49.4914875Z 
2025-05-07T20:25:49.6935359Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.6935750Z 
2025-05-07T20:25:49.6935755Z 
2025-05-07T20:25:49.6935759Z 
2025-05-07T20:25:49.6935763Z 
2025-05-07T20:25:49.6935775Z 
2025-05-07T20:25:49.6935779Z 
2025-05-07T20:25:49.6935783Z 
2025-05-07T20:25:49.6935788Z 
2025-05-07T20:25:49.6935792Z 
2025-05-07T20:25:49.6936065Z 
2025-05-07T20:25:49.6936069Z 
2025-05-07T20:25:49.6936073Z 
2025-05-07T20:25:49.6936076Z 
2025-05-07T20:25:49.6936080Z 
2025-05-07T20:25:49.6936083Z 
2025-05-07T20:25:50.1341538Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.1341918Z 
2025-05-07T20:25:50.1341923Z 
2025-05-07T20:25:50.1341927Z 
2025-05-07T20:25:50.1341930Z 
2025-05-07T20:25:50.1341934Z 
2025-05-07T20:25:50.1341938Z 
2025-05-07T20:25:50.1341941Z 
2025-05-07T20:25:50.1341945Z 
2025-05-07T20:25:50.1341949Z 
2025-05-07T20:25:50.1341953Z 
2025-05-07T20:25:50.1341956Z 
2025-05-07T20:25:50.1341960Z 
2025-05-07T20:25:50.1341964Z 
2025-05-07T20:25:50.1341967Z 
2025-05-07T20:25:54.5330296Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.5330656Z 
2025-05-07T20:25:55.3913382Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:25:55.3921608Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:25:55.3921893Z 
2025-05-07T20:25:55.3922325Z 
2025-05-07T20:25:55.3922343Z 
2025-05-07T20:25:55.3922354Z 
2025-05-07T20:25:55.3922363Z 
2025-05-07T20:25:55.3922373Z 
2025-05-07T20:25:55.3922383Z 
2025-05-07T20:25:55.3922430Z 
2025-05-07T20:25:55.3922440Z 
2025-05-07T20:25:55.3922449Z 
2025-05-07T20:25:55.3922469Z 
2025-05-07T20:25:55.3922475Z 
2025-05-07T20:25:55.3922481Z 
2025-05-07T20:25:55.3922487Z 
2025-05-07T20:25:55.3922494Z 
2025-05-07T20:25:55.3922500Z 
2025-05-07T20:25:55.3922504Z 
2025-05-07T20:25:55.3922509Z 
2025-05-07T20:25:55.3922513Z 
2025-05-07T20:25:55.3922935Z                       
2025-05-07T20:25:55.3923501Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3923830Z                                                      
2025-05-07T20:25:55.3924043Z 
2025-05-07T20:25:55.3924205Z                                                      [A
2025-05-07T20:25:55.3924407Z 
2025-05-07T20:25:55.3924411Z 
2025-05-07T20:25:55.3924588Z                                                      [A[A
2025-05-07T20:25:55.3924805Z 
2025-05-07T20:25:55.3924809Z 
2025-05-07T20:25:55.3924812Z 
2025-05-07T20:25:55.3924981Z                                                      [A[A[A
2025-05-07T20:25:55.3925187Z 
2025-05-07T20:25:55.3925191Z 
2025-05-07T20:25:55.3925194Z 
2025-05-07T20:25:55.3925198Z 
2025-05-07T20:25:55.3925376Z                                                      [A[A[A[A
2025-05-07T20:25:55.3925580Z 
2025-05-07T20:25:55.3925584Z 
2025-05-07T20:25:55.3925588Z 
2025-05-07T20:25:55.3925591Z 
2025-05-07T20:25:55.3925595Z 
2025-05-07T20:25:55.3925771Z                                                      [A[A[A[A[A
2025-05-07T20:25:55.3925978Z 
2025-05-07T20:25:55.3925981Z 
2025-05-07T20:25:55.3925985Z 
2025-05-07T20:25:55.3925988Z 
2025-05-07T20:25:55.3925992Z 
2025-05-07T20:25:55.3925995Z 
2025-05-07T20:25:55.3926174Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:55.3926385Z 
2025-05-07T20:25:55.3926394Z 
2025-05-07T20:25:55.3926398Z 
2025-05-07T20:25:55.3926401Z 
2025-05-07T20:25:55.3926405Z 
2025-05-07T20:25:55.3926408Z 
2025-05-07T20:25:55.3926412Z 
2025-05-07T20:25:55.3926843Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:55.3927062Z 
2025-05-07T20:25:55.3927066Z 
2025-05-07T20:25:55.3927070Z 
2025-05-07T20:25:55.3927073Z 
2025-05-07T20:25:55.3927077Z 
2025-05-07T20:25:55.3927080Z 
2025-05-07T20:25:55.3927084Z 
2025-05-07T20:25:55.3927087Z 
2025-05-07T20:25:55.3927274Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3927492Z 
2025-05-07T20:25:55.3927496Z 
2025-05-07T20:25:55.3927500Z 
2025-05-07T20:25:55.3927504Z 
2025-05-07T20:25:55.3927507Z 
2025-05-07T20:25:55.3927511Z 
2025-05-07T20:25:55.3927514Z 
2025-05-07T20:25:55.3927518Z 
2025-05-07T20:25:55.3927521Z 
2025-05-07T20:25:55.3927726Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3928093Z 
2025-05-07T20:25:55.3928097Z 
2025-05-07T20:25:55.3928101Z 
2025-05-07T20:25:55.3928104Z 
2025-05-07T20:25:55.3928108Z 
2025-05-07T20:25:55.3928111Z 
2025-05-07T20:25:55.3928122Z 
2025-05-07T20:25:55.3928125Z 
2025-05-07T20:25:55.3928137Z 
2025-05-07T20:25:55.3928140Z 
2025-05-07T20:25:55.3928326Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3928544Z 
2025-05-07T20:25:55.3928548Z 
2025-05-07T20:25:55.3928551Z 
2025-05-07T20:25:55.3928561Z 
2025-05-07T20:25:55.3928565Z 
2025-05-07T20:25:55.3928568Z 
2025-05-07T20:25:55.3928572Z 
2025-05-07T20:25:55.3928575Z 
2025-05-07T20:25:55.3928579Z 
2025-05-07T20:25:55.3928582Z 
2025-05-07T20:25:55.3928586Z 
2025-05-07T20:25:55.3928783Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3929011Z 
2025-05-07T20:25:55.3929015Z 
2025-05-07T20:25:55.3929018Z 
2025-05-07T20:25:55.3929022Z 
2025-05-07T20:25:55.3929025Z 
2025-05-07T20:25:55.3929029Z 
2025-05-07T20:25:55.3929037Z 
2025-05-07T20:25:55.3929040Z 
2025-05-07T20:25:55.3929044Z 
2025-05-07T20:25:55.3929047Z 
2025-05-07T20:25:55.3929051Z 
2025-05-07T20:25:55.3929054Z 
2025-05-07T20:25:55.3929255Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3929477Z 
2025-05-07T20:25:55.3929480Z 
2025-05-07T20:25:55.3929484Z 
2025-05-07T20:25:55.3929487Z 
2025-05-07T20:25:55.3929491Z 
2025-05-07T20:25:55.3929495Z 
2025-05-07T20:25:55.3929498Z 
2025-05-07T20:25:55.3929502Z 
2025-05-07T20:25:55.3929505Z 
2025-05-07T20:25:55.3929509Z 
2025-05-07T20:25:55.3929512Z 
2025-05-07T20:25:55.3929516Z 
2025-05-07T20:25:55.3929519Z 
2025-05-07T20:25:55.3929729Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3929952Z 
2025-05-07T20:25:55.3929955Z 
2025-05-07T20:25:55.3929959Z 
2025-05-07T20:25:55.3929963Z 
2025-05-07T20:25:55.3929966Z 
2025-05-07T20:25:55.3929975Z 
2025-05-07T20:25:55.3929979Z 
2025-05-07T20:25:55.3929987Z 
2025-05-07T20:25:55.3929991Z 
2025-05-07T20:25:55.3929994Z 
2025-05-07T20:25:55.3929998Z 
2025-05-07T20:25:55.3930002Z 
2025-05-07T20:25:55.3930005Z 
2025-05-07T20:25:55.3930009Z 
2025-05-07T20:25:55.3930212Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3930446Z 
2025-05-07T20:25:55.3930450Z 
2025-05-07T20:25:55.3930453Z 
2025-05-07T20:25:55.3930457Z 
2025-05-07T20:25:55.3930469Z 
2025-05-07T20:25:55.3930473Z 
2025-05-07T20:25:55.3930476Z 
2025-05-07T20:25:55.3930480Z 
2025-05-07T20:25:55.3930483Z 
2025-05-07T20:25:55.3930487Z 
2025-05-07T20:25:55.3930491Z 
2025-05-07T20:25:55.3930494Z 
2025-05-07T20:25:55.3930498Z 
2025-05-07T20:25:55.3930501Z 
2025-05-07T20:25:55.3930505Z 
2025-05-07T20:25:55.3930719Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3930945Z 
2025-05-07T20:25:55.3930948Z 
2025-05-07T20:25:55.3930952Z 
2025-05-07T20:25:55.3930956Z 
2025-05-07T20:25:55.3930963Z 
2025-05-07T20:25:55.3930967Z 
2025-05-07T20:25:55.3930970Z 
2025-05-07T20:25:55.3930974Z 
2025-05-07T20:25:55.3930977Z 
2025-05-07T20:25:55.3930981Z 
2025-05-07T20:25:55.3930991Z 
2025-05-07T20:25:55.3930995Z 
2025-05-07T20:25:55.3931081Z 
2025-05-07T20:25:55.3931086Z 
2025-05-07T20:25:55.3931089Z 
2025-05-07T20:25:55.3931093Z 
2025-05-07T20:25:55.3931326Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3931563Z 
2025-05-07T20:25:55.3931567Z 
2025-05-07T20:25:55.3931571Z 
2025-05-07T20:25:55.3931574Z 
2025-05-07T20:25:55.3931578Z 
2025-05-07T20:25:55.3931581Z 
2025-05-07T20:25:55.3931585Z 
2025-05-07T20:25:55.3931588Z 
2025-05-07T20:25:55.3931592Z 
2025-05-07T20:25:55.3931596Z 
2025-05-07T20:25:55.3931599Z 
2025-05-07T20:25:55.3931603Z 
2025-05-07T20:25:55.3931606Z 
2025-05-07T20:25:55.3931610Z 
2025-05-07T20:25:55.3931613Z 
2025-05-07T20:25:55.3931617Z 
2025-05-07T20:25:55.3931621Z 
2025-05-07T20:25:55.3931842Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3932268Z 
2025-05-07T20:25:55.3932272Z 
2025-05-07T20:25:55.3932275Z 
2025-05-07T20:25:55.3932279Z 
2025-05-07T20:25:55.3932288Z 
2025-05-07T20:25:55.3932301Z 
2025-05-07T20:25:55.3932305Z 
2025-05-07T20:25:55.3932308Z 
2025-05-07T20:25:55.3932312Z 
2025-05-07T20:25:55.3932315Z 
2025-05-07T20:25:55.3932327Z 
2025-05-07T20:25:55.3932331Z 
2025-05-07T20:25:55.3932334Z 
2025-05-07T20:25:55.3932338Z 
2025-05-07T20:25:55.3932341Z 
2025-05-07T20:25:55.3932345Z 
2025-05-07T20:25:55.3932348Z 
2025-05-07T20:25:55.3932352Z 
2025-05-07T20:25:55.3932575Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3932818Z 
2025-05-07T20:25:55.3932822Z 
2025-05-07T20:25:55.3932929Z [A
2025-05-07T20:25:55.3933032Z 
2025-05-07T20:25:55.3933035Z 
2025-05-07T20:25:55.3933396Z [A[A
2025-05-07T20:25:55.3933519Z 
2025-05-07T20:25:55.3933539Z 
2025-05-07T20:25:55.3933558Z 
2025-05-07T20:25:55.3933782Z [A[A[A
2025-05-07T20:25:55.3933892Z 
2025-05-07T20:25:55.3933898Z 
2025-05-07T20:25:55.3933902Z 
2025-05-07T20:25:55.3933906Z 
2025-05-07T20:25:55.3934428Z [A[A[A[A
2025-05-07T20:25:55.3934591Z 
2025-05-07T20:25:55.3934595Z 
2025-05-07T20:25:55.3934600Z 
2025-05-07T20:25:55.3934605Z 
2025-05-07T20:25:55.3934621Z 
2025-05-07T20:25:55.3934799Z [A[A[A[A[A
2025-05-07T20:25:55.3934924Z 
2025-05-07T20:25:55.3934933Z 
2025-05-07T20:25:55.3934936Z 
2025-05-07T20:25:55.3934940Z 
2025-05-07T20:25:55.3934944Z 
2025-05-07T20:25:55.3934947Z 
2025-05-07T20:25:55.3935289Z [A[A[A[A[A[A
2025-05-07T20:25:55.3935414Z 
2025-05-07T20:25:55.3935421Z 
2025-05-07T20:25:55.3935425Z 
2025-05-07T20:25:55.3935430Z 
2025-05-07T20:25:55.3935433Z 
2025-05-07T20:25:55.3935437Z 
2025-05-07T20:25:55.3935440Z 
2025-05-07T20:25:55.3935892Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3936043Z 
2025-05-07T20:25:55.3936048Z 
2025-05-07T20:25:55.3936054Z 
2025-05-07T20:25:55.3936079Z 
2025-05-07T20:25:55.3936083Z 
2025-05-07T20:25:55.3936089Z 
2025-05-07T20:25:55.3936095Z 
2025-05-07T20:25:55.3936109Z 
2025-05-07T20:25:55.3936349Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3936503Z 
2025-05-07T20:25:55.3936521Z 
2025-05-07T20:25:55.3936530Z 
2025-05-07T20:25:55.3936536Z 
2025-05-07T20:25:55.3936542Z 
2025-05-07T20:25:55.3936557Z 
2025-05-07T20:25:55.3936563Z 
2025-05-07T20:25:55.3936569Z 
2025-05-07T20:25:55.3936575Z 
2025-05-07T20:25:55.3936857Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3937020Z 
2025-05-07T20:25:55.3937032Z 
2025-05-07T20:25:55.3937036Z 
2025-05-07T20:25:55.3937040Z 
2025-05-07T20:25:55.3937051Z 
2025-05-07T20:25:55.3937055Z 
2025-05-07T20:25:55.3937060Z 
2025-05-07T20:25:55.3937065Z 
2025-05-07T20:25:55.3937070Z 
2025-05-07T20:25:55.3937075Z 
2025-05-07T20:25:55.3937358Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3937542Z 
2025-05-07T20:25:55.3937548Z 
2025-05-07T20:25:55.3937558Z 
2025-05-07T20:25:55.3937571Z 
2025-05-07T20:25:55.3937590Z 
2025-05-07T20:25:55.3937595Z 
2025-05-07T20:25:55.3937600Z 
2025-05-07T20:25:55.3937605Z 
2025-05-07T20:25:55.3937610Z 
2025-05-07T20:25:55.3937614Z 
2025-05-07T20:25:55.3937619Z 
2025-05-07T20:25:55.3937972Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3938240Z 
2025-05-07T20:25:55.3938251Z 
2025-05-07T20:25:55.3938276Z 
2025-05-07T20:25:55.3938284Z 
2025-05-07T20:25:55.3938292Z 
2025-05-07T20:25:55.3938300Z 
2025-05-07T20:25:55.3938309Z 
2025-05-07T20:25:55.3938317Z 
2025-05-07T20:25:55.3938324Z 
2025-05-07T20:25:55.3938332Z 
2025-05-07T20:25:55.3938338Z 
2025-05-07T20:25:55.3938344Z 
2025-05-07T20:25:55.3938513Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3938705Z 
2025-05-07T20:25:55.3938710Z 
2025-05-07T20:25:55.3938714Z 
2025-05-07T20:25:55.3938719Z 
2025-05-07T20:25:55.3938724Z 
2025-05-07T20:25:55.3938727Z 
2025-05-07T20:25:55.3938731Z 
2025-05-07T20:25:55.3938735Z 
2025-05-07T20:25:55.3938738Z 
2025-05-07T20:25:55.3938742Z 
2025-05-07T20:25:55.3938745Z 
2025-05-07T20:25:55.3938851Z 
2025-05-07T20:25:55.3938855Z 
2025-05-07T20:25:55.3939010Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3939197Z 
2025-05-07T20:25:55.3939200Z 
2025-05-07T20:25:55.3939204Z 
2025-05-07T20:25:55.3939207Z 
2025-05-07T20:25:55.3939218Z 
2025-05-07T20:25:55.3939222Z 
2025-05-07T20:25:55.3939225Z 
2025-05-07T20:25:55.3939229Z 
2025-05-07T20:25:55.3939232Z 
2025-05-07T20:25:55.3939236Z 
2025-05-07T20:25:55.3939246Z 
2025-05-07T20:25:55.3939250Z 
2025-05-07T20:25:55.3939254Z 
2025-05-07T20:25:55.3939257Z 
2025-05-07T20:25:55.3939407Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3939600Z 
2025-05-07T20:25:55.3939603Z 
2025-05-07T20:25:55.3939612Z 
2025-05-07T20:25:55.3939616Z 
2025-05-07T20:25:55.3939620Z 
2025-05-07T20:25:55.3939623Z 
2025-05-07T20:25:55.3939628Z 
2025-05-07T20:25:55.3939631Z 
2025-05-07T20:25:55.3939635Z 
2025-05-07T20:25:55.3939638Z 
2025-05-07T20:25:55.3939642Z 
2025-05-07T20:25:55.3939645Z 
2025-05-07T20:25:55.3939649Z 
2025-05-07T20:25:55.3939652Z 
2025-05-07T20:25:55.3939662Z 
2025-05-07T20:25:55.3939816Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3940020Z 
2025-05-07T20:25:55.3940024Z 
2025-05-07T20:25:55.3940027Z 
2025-05-07T20:25:55.3940031Z 
2025-05-07T20:25:55.3940039Z 
2025-05-07T20:25:55.3940042Z 
2025-05-07T20:25:55.3940046Z 
2025-05-07T20:25:55.3940049Z 
2025-05-07T20:25:55.3940053Z 
2025-05-07T20:25:55.3940058Z 
2025-05-07T20:25:55.3940061Z 
2025-05-07T20:25:55.3940065Z 
2025-05-07T20:25:55.3940068Z 
2025-05-07T20:25:55.3940072Z 
2025-05-07T20:25:55.3940075Z 
2025-05-07T20:25:55.3940084Z 
2025-05-07T20:25:55.3940253Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3940494Z 
2025-05-07T20:25:55.3940497Z 
2025-05-07T20:25:55.3940501Z 
2025-05-07T20:25:55.3940504Z 
2025-05-07T20:25:55.3940515Z 
2025-05-07T20:25:55.3940519Z 
2025-05-07T20:25:55.3940523Z 
2025-05-07T20:25:55.3940526Z 
2025-05-07T20:25:55.3940530Z 
2025-05-07T20:25:55.3940533Z 
2025-05-07T20:25:55.3940537Z 
2025-05-07T20:25:55.3940541Z 
2025-05-07T20:25:55.3940547Z 
2025-05-07T20:25:55.3940551Z 
2025-05-07T20:25:55.3940555Z 
2025-05-07T20:25:55.3940558Z 
2025-05-07T20:25:55.3940562Z 
2025-05-07T20:25:55.3940727Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3940945Z 
2025-05-07T20:25:55.3940948Z 
2025-05-07T20:25:55.3940952Z 
2025-05-07T20:25:55.3940956Z 
2025-05-07T20:25:55.3940959Z 
2025-05-07T20:25:55.3940963Z 
2025-05-07T20:25:55.3940966Z 
2025-05-07T20:25:55.3940970Z 
2025-05-07T20:25:55.3940973Z 
2025-05-07T20:25:55.3940977Z 
2025-05-07T20:25:55.3940980Z 
2025-05-07T20:25:55.3940984Z 
2025-05-07T20:25:55.3940987Z 
2025-05-07T20:25:55.3940991Z 
2025-05-07T20:25:55.3940994Z 
2025-05-07T20:25:55.3940998Z 
2025-05-07T20:25:55.3941008Z 
2025-05-07T20:25:55.3941011Z 
2025-05-07T20:25:55.3941475Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3941685Z 
2025-05-07T20:25:55.3941694Z 
2025-05-07T20:25:55.3941999Z [A
2025-05-07T20:25:55.3942125Z 
2025-05-07T20:25:55.3942129Z 
2025-05-07T20:25:55.3942353Z [A[A
2025-05-07T20:25:55.3942462Z 
2025-05-07T20:25:55.3942470Z 
2025-05-07T20:25:55.3942474Z 
2025-05-07T20:25:55.3942807Z [A[A[A
2025-05-07T20:25:55.3942921Z 
2025-05-07T20:25:55.3942924Z 
2025-05-07T20:25:55.3942928Z 
2025-05-07T20:25:55.3943086Z 
2025-05-07T20:25:55.3943340Z [A[A[A[A
2025-05-07T20:25:55.3943480Z 
2025-05-07T20:25:55.3943489Z 
2025-05-07T20:25:55.3943496Z 
2025-05-07T20:25:55.3943501Z 
2025-05-07T20:25:55.3943507Z 
2025-05-07T20:25:55.3943778Z [A[A[A[A[A
2025-05-07T20:25:55.3943926Z 
2025-05-07T20:25:55.3943941Z 
2025-05-07T20:25:55.3943947Z 
2025-05-07T20:25:55.3943952Z 
2025-05-07T20:25:55.3943957Z 
2025-05-07T20:25:55.3943962Z 
2025-05-07T20:25:55.3944174Z [A[A[A[A[A[A
2025-05-07T20:25:55.3944311Z 
2025-05-07T20:25:55.3944324Z 
2025-05-07T20:25:55.3944328Z 
2025-05-07T20:25:55.3944332Z 
2025-05-07T20:25:55.3944336Z 
2025-05-07T20:25:55.3944339Z 
2025-05-07T20:25:55.3944343Z 
2025-05-07T20:25:55.3944706Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3944861Z 
2025-05-07T20:25:55.3945076Z 
2025-05-07T20:25:55.3945085Z 
2025-05-07T20:25:55.3945116Z 
2025-05-07T20:25:55.3945126Z 
2025-05-07T20:25:55.3945136Z 
2025-05-07T20:25:55.3945145Z 
2025-05-07T20:25:55.3945152Z 
2025-05-07T20:25:55.3945356Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3945542Z 
2025-05-07T20:25:55.3945555Z 
2025-05-07T20:25:55.3945561Z 
2025-05-07T20:25:55.3945566Z 
2025-05-07T20:25:55.3945572Z 
2025-05-07T20:25:55.3945578Z 
2025-05-07T20:25:55.3945581Z 
2025-05-07T20:25:55.3945585Z 
2025-05-07T20:25:55.3945588Z 
2025-05-07T20:25:55.3945718Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3945879Z 
2025-05-07T20:25:55.3945883Z 
2025-05-07T20:25:55.3945886Z 
2025-05-07T20:25:55.3945890Z 
2025-05-07T20:25:55.3945893Z 
2025-05-07T20:25:55.3945897Z 
2025-05-07T20:25:55.3945900Z 
2025-05-07T20:25:55.3945904Z 
2025-05-07T20:25:55.3945907Z 
2025-05-07T20:25:55.3945913Z 
2025-05-07T20:25:55.3946187Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3946360Z 
2025-05-07T20:25:55.3946364Z 
2025-05-07T20:25:55.3946383Z 
2025-05-07T20:25:55.3946386Z 
2025-05-07T20:25:55.3946390Z 
2025-05-07T20:25:55.3946393Z 
2025-05-07T20:25:55.3946397Z 
2025-05-07T20:25:55.3946400Z 
2025-05-07T20:25:55.3946404Z 
2025-05-07T20:25:55.3946407Z 
2025-05-07T20:25:55.3946419Z 
2025-05-07T20:25:55.3946639Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3946818Z 
2025-05-07T20:25:55.3946822Z 
2025-05-07T20:25:55.3946829Z 
2025-05-07T20:25:55.3946833Z 
2025-05-07T20:25:55.3946837Z 
2025-05-07T20:25:55.3946840Z 
2025-05-07T20:25:55.3946844Z 
2025-05-07T20:25:55.3946847Z 
2025-05-07T20:25:55.3946851Z 
2025-05-07T20:25:55.3946854Z 
2025-05-07T20:25:55.3946858Z 
2025-05-07T20:25:55.3946862Z 
2025-05-07T20:25:55.3947074Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3947259Z 
2025-05-07T20:25:55.3947263Z 
2025-05-07T20:25:55.3947266Z 
2025-05-07T20:25:55.3947270Z 
2025-05-07T20:25:55.3947274Z 
2025-05-07T20:25:55.3947277Z 
2025-05-07T20:25:55.3947281Z 
2025-05-07T20:25:55.3947284Z 
2025-05-07T20:25:55.3947288Z 
2025-05-07T20:25:55.3947303Z 
2025-05-07T20:25:55.3947307Z 
2025-05-07T20:25:55.3947310Z 
2025-05-07T20:25:55.3947314Z 
2025-05-07T20:25:55.3947505Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3947691Z 
2025-05-07T20:25:55.3947705Z 
2025-05-07T20:25:55.3947716Z 
2025-05-07T20:25:55.3947720Z 
2025-05-07T20:25:55.3947723Z 
2025-05-07T20:25:55.3947727Z 
2025-05-07T20:25:55.3947730Z 
2025-05-07T20:25:55.3947734Z 
2025-05-07T20:25:55.3947737Z 
2025-05-07T20:25:55.3947741Z 
2025-05-07T20:25:55.3947744Z 
2025-05-07T20:25:55.3947748Z 
2025-05-07T20:25:55.3947751Z 
2025-05-07T20:25:55.3947755Z 
2025-05-07T20:25:55.3948054Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3948260Z 
2025-05-07T20:25:55.3948265Z 
2025-05-07T20:25:55.3948281Z 
2025-05-07T20:25:55.3948286Z 
2025-05-07T20:25:55.3948291Z 
2025-05-07T20:25:55.3948295Z 
2025-05-07T20:25:55.3948301Z 
2025-05-07T20:25:55.3948305Z 
2025-05-07T20:25:55.3948310Z 
2025-05-07T20:25:55.3948314Z 
2025-05-07T20:25:55.3948319Z 
2025-05-07T20:25:55.3948337Z 
2025-05-07T20:25:55.3948349Z 
2025-05-07T20:25:55.3948353Z 
2025-05-07T20:25:55.3948358Z 
2025-05-07T20:25:55.3948507Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3948706Z 
2025-05-07T20:25:55.3948709Z 
2025-05-07T20:25:55.3948835Z 
2025-05-07T20:25:55.3948840Z 
2025-05-07T20:25:55.3948850Z 
2025-05-07T20:25:55.3948854Z 
2025-05-07T20:25:55.3948858Z 
2025-05-07T20:25:55.3948861Z 
2025-05-07T20:25:55.3948865Z 
2025-05-07T20:25:55.3948870Z 
2025-05-07T20:25:55.3948873Z 
2025-05-07T20:25:55.3948877Z 
2025-05-07T20:25:55.3948880Z 
2025-05-07T20:25:55.3948884Z 
2025-05-07T20:25:55.3948888Z 
2025-05-07T20:25:55.3948891Z 
2025-05-07T20:25:55.3949061Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3949273Z 
2025-05-07T20:25:55.3949277Z 
2025-05-07T20:25:55.3949280Z 
2025-05-07T20:25:55.3949284Z 
2025-05-07T20:25:55.3949288Z 
2025-05-07T20:25:55.3949291Z 
2025-05-07T20:25:55.3949295Z 
2025-05-07T20:25:55.3949299Z 
2025-05-07T20:25:55.3949302Z 
2025-05-07T20:25:55.3949397Z 
2025-05-07T20:25:55.3949400Z 
2025-05-07T20:25:55.3949404Z 
2025-05-07T20:25:55.3949407Z 
2025-05-07T20:25:55.3949411Z 
2025-05-07T20:25:55.3949425Z 
2025-05-07T20:25:55.3949435Z 
2025-05-07T20:25:55.3949438Z 
2025-05-07T20:25:55.3949600Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3949808Z 
2025-05-07T20:25:55.3949811Z 
2025-05-07T20:25:55.3949815Z 
2025-05-07T20:25:55.3949819Z 
2025-05-07T20:25:55.3949822Z 
2025-05-07T20:25:55.3949832Z 
2025-05-07T20:25:55.3949835Z 
2025-05-07T20:25:55.3949839Z 
2025-05-07T20:25:55.3949843Z 
2025-05-07T20:25:55.3949846Z 
2025-05-07T20:25:55.3949850Z 
2025-05-07T20:25:55.3949853Z 
2025-05-07T20:25:55.3949857Z 
2025-05-07T20:25:55.3949861Z 
2025-05-07T20:25:55.3949864Z 
2025-05-07T20:25:55.3949868Z 
2025-05-07T20:25:55.3949871Z 
2025-05-07T20:25:55.3949875Z 
2025-05-07T20:25:55.3950259Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3950548Z 
2025-05-07T20:25:55.3950553Z 
2025-05-07T20:25:55.3950701Z [A
2025-05-07T20:25:55.3950855Z 
2025-05-07T20:25:55.3950859Z 
2025-05-07T20:25:55.3950970Z [A[A
2025-05-07T20:25:55.3951077Z 
2025-05-07T20:25:55.3951080Z 
2025-05-07T20:25:55.3951084Z 
2025-05-07T20:25:55.3951283Z [A[A[A
2025-05-07T20:25:55.3951400Z 
2025-05-07T20:25:55.3951404Z 
2025-05-07T20:25:55.3951409Z 
2025-05-07T20:25:55.3951417Z 
2025-05-07T20:25:55.3951763Z [A[A[A[A
2025-05-07T20:25:55.3951898Z 
2025-05-07T20:25:55.3951907Z 
2025-05-07T20:25:55.3951912Z 
2025-05-07T20:25:55.3951918Z 
2025-05-07T20:25:55.3951924Z 
2025-05-07T20:25:55.3952082Z [A[A[A[A[A
2025-05-07T20:25:55.3952213Z 
2025-05-07T20:25:55.3952224Z 
2025-05-07T20:25:55.3952230Z 
2025-05-07T20:25:55.3952235Z 
2025-05-07T20:25:55.3952241Z 
2025-05-07T20:25:55.3952247Z 
2025-05-07T20:25:55.3952543Z [A[A[A[A[A[A
2025-05-07T20:25:55.3952685Z 
2025-05-07T20:25:55.3952691Z 
2025-05-07T20:25:55.3952697Z 
2025-05-07T20:25:55.3952704Z 
2025-05-07T20:25:55.3952710Z 
2025-05-07T20:25:55.3952715Z 
2025-05-07T20:25:55.3952720Z 
2025-05-07T20:25:55.3952901Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3953048Z 
2025-05-07T20:25:55.3953055Z 
2025-05-07T20:25:55.3953058Z 
2025-05-07T20:25:55.3953063Z 
2025-05-07T20:25:55.3953067Z 
2025-05-07T20:25:55.3953071Z 
2025-05-07T20:25:55.3953086Z 
2025-05-07T20:25:55.3953090Z 
2025-05-07T20:25:55.3953296Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3953452Z 
2025-05-07T20:25:55.3953462Z 
2025-05-07T20:25:55.3953465Z 
2025-05-07T20:25:55.3953469Z 
2025-05-07T20:25:55.3953472Z 
2025-05-07T20:25:55.3953476Z 
2025-05-07T20:25:55.3953479Z 
2025-05-07T20:25:55.3953483Z 
2025-05-07T20:25:55.3953486Z 
2025-05-07T20:25:55.3953779Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3953943Z 
2025-05-07T20:25:55.3953959Z 
2025-05-07T20:25:55.3953965Z 
2025-05-07T20:25:55.3953970Z 
2025-05-07T20:25:55.3953975Z 
2025-05-07T20:25:55.3953980Z 
2025-05-07T20:25:55.3953985Z 
2025-05-07T20:25:55.3953997Z 
2025-05-07T20:25:55.3954003Z 
2025-05-07T20:25:55.3954008Z 
2025-05-07T20:25:55.3954140Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3954313Z 
2025-05-07T20:25:55.3954319Z 
2025-05-07T20:25:55.3954324Z 
2025-05-07T20:25:55.3954328Z 
2025-05-07T20:25:55.3954333Z 
2025-05-07T20:25:55.3954349Z 
2025-05-07T20:25:55.3954353Z 
2025-05-07T20:25:55.3954358Z 
2025-05-07T20:25:55.3954470Z 
2025-05-07T20:25:55.3954475Z 
2025-05-07T20:25:55.3954478Z 
2025-05-07T20:25:55.3954616Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3954804Z 
2025-05-07T20:25:55.3954819Z 
2025-05-07T20:25:55.3954822Z 
2025-05-07T20:25:55.3954826Z 
2025-05-07T20:25:55.3954829Z 
2025-05-07T20:25:55.3954833Z 
2025-05-07T20:25:55.3954837Z 
2025-05-07T20:25:55.3954840Z 
2025-05-07T20:25:55.3954844Z 
2025-05-07T20:25:55.3954847Z 
2025-05-07T20:25:55.3954851Z 
2025-05-07T20:25:55.3954855Z 
2025-05-07T20:25:55.3954987Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3955173Z 
2025-05-07T20:25:55.3955177Z 
2025-05-07T20:25:55.3955181Z 
2025-05-07T20:25:55.3955184Z 
2025-05-07T20:25:55.3955188Z 
2025-05-07T20:25:55.3955191Z 
2025-05-07T20:25:55.3955195Z 
2025-05-07T20:25:55.3955280Z 
2025-05-07T20:25:55.3955284Z 
2025-05-07T20:25:55.3955287Z 
2025-05-07T20:25:55.3955291Z 
2025-05-07T20:25:55.3955294Z 
2025-05-07T20:25:55.3955307Z 
2025-05-07T20:25:55.3955455Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3955642Z 
2025-05-07T20:25:55.3955645Z 
2025-05-07T20:25:55.3955649Z 
2025-05-07T20:25:55.3955652Z 
2025-05-07T20:25:55.3955656Z 
2025-05-07T20:25:55.3955659Z 
2025-05-07T20:25:55.3955663Z 
2025-05-07T20:25:55.3955668Z 
2025-05-07T20:25:55.3955671Z 
2025-05-07T20:25:55.3955675Z 
2025-05-07T20:25:55.3955684Z 
2025-05-07T20:25:55.3955688Z 
2025-05-07T20:25:55.3955691Z 
2025-05-07T20:25:55.3955695Z 
2025-05-07T20:25:55.3955841Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3956034Z 
2025-05-07T20:25:55.3956037Z 
2025-05-07T20:25:55.3956046Z 
2025-05-07T20:25:55.3956050Z 
2025-05-07T20:25:55.3956054Z 
2025-05-07T20:25:55.3956057Z 
2025-05-07T20:25:55.3956061Z 
2025-05-07T20:25:55.3956064Z 
2025-05-07T20:25:55.3956068Z 
2025-05-07T20:25:55.3956077Z 
2025-05-07T20:25:55.3956080Z 
2025-05-07T20:25:55.3956084Z 
2025-05-07T20:25:55.3956087Z 
2025-05-07T20:25:55.3956091Z 
2025-05-07T20:25:55.3956094Z 
2025-05-07T20:25:55.3956250Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3956458Z 
2025-05-07T20:25:55.3956462Z 
2025-05-07T20:25:55.3956466Z 
2025-05-07T20:25:55.3956469Z 
2025-05-07T20:25:55.3956473Z 
2025-05-07T20:25:55.3956476Z 
2025-05-07T20:25:55.3956480Z 
2025-05-07T20:25:55.3956483Z 
2025-05-07T20:25:55.3956487Z 
2025-05-07T20:25:55.3956490Z 
2025-05-07T20:25:55.3956494Z 
2025-05-07T20:25:55.3956497Z 
2025-05-07T20:25:55.3956501Z 
2025-05-07T20:25:55.3956504Z 
2025-05-07T20:25:55.3956508Z 
2025-05-07T20:25:55.3956516Z 
2025-05-07T20:25:55.3956675Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3956875Z 
2025-05-07T20:25:55.3956879Z 
2025-05-07T20:25:55.3956883Z 
2025-05-07T20:25:55.3956886Z 
2025-05-07T20:25:55.3956895Z 
2025-05-07T20:25:55.3956899Z 
2025-05-07T20:25:55.3956902Z 
2025-05-07T20:25:55.3956910Z 
2025-05-07T20:25:55.3956914Z 
2025-05-07T20:25:55.3956917Z 
2025-05-07T20:25:55.3956921Z 
2025-05-07T20:25:55.3956925Z 
2025-05-07T20:25:55.3956928Z 
2025-05-07T20:25:55.3956932Z 
2025-05-07T20:25:55.3956935Z 
2025-05-07T20:25:55.3956942Z 
2025-05-07T20:25:55.3956946Z 
2025-05-07T20:25:55.3957110Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3957324Z 
2025-05-07T20:25:55.3957328Z 
2025-05-07T20:25:55.3957332Z 
2025-05-07T20:25:55.3957335Z 
2025-05-07T20:25:55.3957339Z 
2025-05-07T20:25:55.3957342Z 
2025-05-07T20:25:55.3957346Z 
2025-05-07T20:25:55.3957349Z 
2025-05-07T20:25:55.3957353Z 
2025-05-07T20:25:55.3957357Z 
2025-05-07T20:25:55.3957360Z 
2025-05-07T20:25:55.3957364Z 
2025-05-07T20:25:55.3957367Z 
2025-05-07T20:25:55.3957371Z 
2025-05-07T20:25:55.3957374Z 
2025-05-07T20:25:55.3957378Z 
2025-05-07T20:25:55.3957389Z 
2025-05-07T20:25:55.3957392Z 
2025-05-07T20:25:55.3957562Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3957768Z 
2025-05-07T20:25:55.3957775Z 
2025-05-07T20:25:55.3957883Z [A
2025-05-07T20:25:55.3957985Z 
2025-05-07T20:25:55.3957989Z 
2025-05-07T20:25:55.3958266Z [A[A
2025-05-07T20:25:55.3958384Z 
2025-05-07T20:25:55.3958392Z 
2025-05-07T20:25:55.3958536Z 
2025-05-07T20:25:55.3958647Z [A[A[A
2025-05-07T20:25:55.3958763Z 
2025-05-07T20:25:55.3958768Z 
2025-05-07T20:25:55.3958775Z 
2025-05-07T20:25:55.3958778Z 
2025-05-07T20:25:55.3958937Z [A[A[A[A
2025-05-07T20:25:55.3959061Z 
2025-05-07T20:25:55.3959066Z 
2025-05-07T20:25:55.3959070Z 
2025-05-07T20:25:55.3959074Z 
2025-05-07T20:25:55.3959077Z 
2025-05-07T20:25:55.3959293Z [A[A[A[A[A
2025-05-07T20:25:55.3959423Z 
2025-05-07T20:25:55.3959432Z 
2025-05-07T20:25:55.3959436Z 
2025-05-07T20:25:55.3959439Z 
2025-05-07T20:25:55.3959443Z 
2025-05-07T20:25:55.3959446Z 
2025-05-07T20:25:55.3959839Z [A[A[A[A[A[A
2025-05-07T20:25:55.3959993Z 
2025-05-07T20:25:55.3959999Z 
2025-05-07T20:25:55.3960005Z 
2025-05-07T20:25:55.3960011Z 
2025-05-07T20:25:55.3960025Z 
2025-05-07T20:25:55.3960588Z 
2025-05-07T20:25:55.3960609Z 
2025-05-07T20:25:55.3960741Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3960886Z 
2025-05-07T20:25:55.3960891Z 
2025-05-07T20:25:55.3960903Z 
2025-05-07T20:25:55.3960907Z 
2025-05-07T20:25:55.3960926Z 
2025-05-07T20:25:55.3960930Z 
2025-05-07T20:25:55.3960934Z 
2025-05-07T20:25:55.3960939Z 
2025-05-07T20:25:55.3961060Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3961214Z 
2025-05-07T20:25:55.3961217Z 
2025-05-07T20:25:55.3961228Z 
2025-05-07T20:25:55.3961231Z 
2025-05-07T20:25:55.3961235Z 
2025-05-07T20:25:55.3961239Z 
2025-05-07T20:25:55.3961242Z 
2025-05-07T20:25:55.3961246Z 
2025-05-07T20:25:55.3961249Z 
2025-05-07T20:25:55.3961366Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3961526Z 
2025-05-07T20:25:55.3961530Z 
2025-05-07T20:25:55.3961549Z 
2025-05-07T20:25:55.3961552Z 
2025-05-07T20:25:55.3961556Z 
2025-05-07T20:25:55.3961559Z 
2025-05-07T20:25:55.3961563Z 
2025-05-07T20:25:55.3961566Z 
2025-05-07T20:25:55.3961570Z 
2025-05-07T20:25:55.3961579Z 
2025-05-07T20:25:55.3961705Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3961868Z 
2025-05-07T20:25:55.3961872Z 
2025-05-07T20:25:55.3961876Z 
2025-05-07T20:25:55.3961879Z 
2025-05-07T20:25:55.3961883Z 
2025-05-07T20:25:55.3961892Z 
2025-05-07T20:25:55.3961895Z 
2025-05-07T20:25:55.3961899Z 
2025-05-07T20:25:55.3961902Z 
2025-05-07T20:25:55.3961906Z 
2025-05-07T20:25:55.3961909Z 
2025-05-07T20:25:55.3962050Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3962220Z 
2025-05-07T20:25:55.3962224Z 
2025-05-07T20:25:55.3962228Z 
2025-05-07T20:25:55.3962231Z 
2025-05-07T20:25:55.3962235Z 
2025-05-07T20:25:55.3962238Z 
2025-05-07T20:25:55.3962242Z 
2025-05-07T20:25:55.3962246Z 
2025-05-07T20:25:55.3962249Z 
2025-05-07T20:25:55.3962253Z 
2025-05-07T20:25:55.3962256Z 
2025-05-07T20:25:55.3962261Z 
2025-05-07T20:25:55.3962409Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3962591Z 
2025-05-07T20:25:55.3962595Z 
2025-05-07T20:25:55.3962598Z 
2025-05-07T20:25:55.3962602Z 
2025-05-07T20:25:55.3962610Z 
2025-05-07T20:25:55.3962620Z 
2025-05-07T20:25:55.3962623Z 
2025-05-07T20:25:55.3962627Z 
2025-05-07T20:25:55.3962630Z 
2025-05-07T20:25:55.3962634Z 
2025-05-07T20:25:55.3962638Z 
2025-05-07T20:25:55.3962641Z 
2025-05-07T20:25:55.3962650Z 
2025-05-07T20:25:55.3962788Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3962980Z 
2025-05-07T20:25:55.3962983Z 
2025-05-07T20:25:55.3962987Z 
2025-05-07T20:25:55.3962991Z 
2025-05-07T20:25:55.3962994Z 
2025-05-07T20:25:55.3962998Z 
2025-05-07T20:25:55.3963001Z 
2025-05-07T20:25:55.3963005Z 
2025-05-07T20:25:55.3963008Z 
2025-05-07T20:25:55.3963012Z 
2025-05-07T20:25:55.3963015Z 
2025-05-07T20:25:55.3963019Z 
2025-05-07T20:25:55.3963022Z 
2025-05-07T20:25:55.3963026Z 
2025-05-07T20:25:55.3963165Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3963365Z 
2025-05-07T20:25:55.3963369Z 
2025-05-07T20:25:55.3963373Z 
2025-05-07T20:25:55.3963376Z 
2025-05-07T20:25:55.3963380Z 
2025-05-07T20:25:55.3963384Z 
2025-05-07T20:25:55.3963923Z 
2025-05-07T20:25:55.3963934Z 
2025-05-07T20:25:55.3963958Z 
2025-05-07T20:25:55.3963978Z 
2025-05-07T20:25:55.3963996Z 
2025-05-07T20:25:55.3964068Z 
2025-05-07T20:25:55.3964080Z 
2025-05-07T20:25:55.3964087Z 
2025-05-07T20:25:55.3964093Z 
2025-05-07T20:25:55.3964941Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3965251Z 
2025-05-07T20:25:55.3965282Z 
2025-05-07T20:25:55.3965289Z 
2025-05-07T20:25:55.3965297Z 
2025-05-07T20:25:55.3965302Z 
2025-05-07T20:25:55.3965310Z 
2025-05-07T20:25:55.3965314Z 
2025-05-07T20:25:55.3965319Z 
2025-05-07T20:25:55.3965325Z 
2025-05-07T20:25:55.3965330Z 
2025-05-07T20:25:55.3965335Z 
2025-05-07T20:25:55.3965340Z 
2025-05-07T20:25:55.3965346Z 
2025-05-07T20:25:55.3965352Z 
2025-05-07T20:25:55.3965359Z 
2025-05-07T20:25:55.3965375Z 
2025-05-07T20:25:55.3965592Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3965878Z 
2025-05-07T20:25:55.3965885Z 
2025-05-07T20:25:55.3965892Z 
2025-05-07T20:25:55.3965898Z 
2025-05-07T20:25:55.3965903Z 
2025-05-07T20:25:55.3966031Z 
2025-05-07T20:25:55.3966036Z 
2025-05-07T20:25:55.3966052Z 
2025-05-07T20:25:55.3966057Z 
2025-05-07T20:25:55.3966061Z 
2025-05-07T20:25:55.3966066Z 
2025-05-07T20:25:55.3966071Z 
2025-05-07T20:25:55.3966075Z 
2025-05-07T20:25:55.3966093Z 
2025-05-07T20:25:55.3966098Z 
2025-05-07T20:25:55.3966103Z 
2025-05-07T20:25:55.3966108Z 
2025-05-07T20:25:55.3966327Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3966619Z 
2025-05-07T20:25:55.3966624Z 
2025-05-07T20:25:55.3966630Z 
2025-05-07T20:25:55.3966635Z 
2025-05-07T20:25:55.3966640Z 
2025-05-07T20:25:55.3966645Z 
2025-05-07T20:25:55.3966650Z 
2025-05-07T20:25:55.3966655Z 
2025-05-07T20:25:55.3966660Z 
2025-05-07T20:25:55.3966664Z 
2025-05-07T20:25:55.3966669Z 
2025-05-07T20:25:55.3966674Z 
2025-05-07T20:25:55.3966679Z 
2025-05-07T20:25:55.3966684Z 
2025-05-07T20:25:55.3966688Z 
2025-05-07T20:25:55.3966693Z 
2025-05-07T20:25:55.3966698Z 
2025-05-07T20:25:55.3966703Z 
2025-05-07T20:25:55.3966929Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3967227Z 
2025-05-07T20:25:55.3967232Z 
2025-05-07T20:25:55.3967371Z [A
2025-05-07T20:25:55.3967507Z 
2025-05-07T20:25:55.3967512Z 
2025-05-07T20:25:55.3967645Z [A[A
2025-05-07T20:25:55.3967801Z 
2025-05-07T20:25:55.3967806Z 
2025-05-07T20:25:55.3967811Z 
2025-05-07T20:25:55.3967948Z [A[A[A
2025-05-07T20:25:55.3968095Z 
2025-05-07T20:25:55.3968100Z 
2025-05-07T20:25:55.3968114Z 
2025-05-07T20:25:55.3968120Z 
2025-05-07T20:25:55.3968262Z [A[A[A[A
2025-05-07T20:25:55.3968415Z 
2025-05-07T20:25:55.3968420Z 
2025-05-07T20:25:55.3968426Z 
2025-05-07T20:25:55.3968430Z 
2025-05-07T20:25:55.3968436Z 
2025-05-07T20:25:55.3968590Z [A[A[A[A[A
2025-05-07T20:25:55.3968754Z 
2025-05-07T20:25:55.3968759Z 
2025-05-07T20:25:55.3968764Z 
2025-05-07T20:25:55.3968769Z 
2025-05-07T20:25:55.3968774Z 
2025-05-07T20:25:55.3968779Z 
2025-05-07T20:25:55.3968940Z [A[A[A[A[A[A
2025-05-07T20:25:55.3969107Z 
2025-05-07T20:25:55.3969112Z 
2025-05-07T20:25:55.3969124Z 
2025-05-07T20:25:55.3969129Z 
2025-05-07T20:25:55.3969134Z 
2025-05-07T20:25:55.3969139Z 
2025-05-07T20:25:55.3969144Z 
2025-05-07T20:25:55.3969301Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3969488Z 
2025-05-07T20:25:55.3969494Z 
2025-05-07T20:25:55.3969507Z 
2025-05-07T20:25:55.3969512Z 
2025-05-07T20:25:55.3969516Z 
2025-05-07T20:25:55.3969521Z 
2025-05-07T20:25:55.3969526Z 
2025-05-07T20:25:55.3969531Z 
2025-05-07T20:25:55.3969741Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3969951Z 
2025-05-07T20:25:55.3969957Z 
2025-05-07T20:25:55.3969962Z 
2025-05-07T20:25:55.3969967Z 
2025-05-07T20:25:55.3969972Z 
2025-05-07T20:25:55.3969976Z 
2025-05-07T20:25:55.3969981Z 
2025-05-07T20:25:55.3969986Z 
2025-05-07T20:25:55.3969991Z 
2025-05-07T20:25:55.3970153Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3970371Z 
2025-05-07T20:25:55.3970376Z 
2025-05-07T20:25:55.3970381Z 
2025-05-07T20:25:55.3970386Z 
2025-05-07T20:25:55.3970391Z 
2025-05-07T20:25:55.3970396Z 
2025-05-07T20:25:55.3970401Z 
2025-05-07T20:25:55.3970414Z 
2025-05-07T20:25:55.3970419Z 
2025-05-07T20:25:55.3970425Z 
2025-05-07T20:25:55.3970597Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3970827Z 
2025-05-07T20:25:55.3970832Z 
2025-05-07T20:25:55.3970838Z 
2025-05-07T20:25:55.3970955Z 
2025-05-07T20:25:55.3970961Z 
2025-05-07T20:25:55.3970966Z 
2025-05-07T20:25:55.3970971Z 
2025-05-07T20:25:55.3970976Z 
2025-05-07T20:25:55.3970981Z 
2025-05-07T20:25:55.3970986Z 
2025-05-07T20:25:55.3970991Z 
2025-05-07T20:25:55.3971189Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3971430Z 
2025-05-07T20:25:55.3971435Z 
2025-05-07T20:25:55.3971440Z 
2025-05-07T20:25:55.3971444Z 
2025-05-07T20:25:55.3971449Z 
2025-05-07T20:25:55.3971454Z 
2025-05-07T20:25:55.3971458Z 
2025-05-07T20:25:55.3971463Z 
2025-05-07T20:25:55.3971468Z 
2025-05-07T20:25:55.3971473Z 
2025-05-07T20:25:55.3971478Z 
2025-05-07T20:25:55.3971483Z 
2025-05-07T20:25:55.3971670Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3971917Z 
2025-05-07T20:25:55.3971922Z 
2025-05-07T20:25:55.3972142Z 
2025-05-07T20:25:55.3972147Z 
2025-05-07T20:25:55.3972153Z 
2025-05-07T20:25:55.3972157Z 
2025-05-07T20:25:55.3972163Z 
2025-05-07T20:25:55.3972168Z 
2025-05-07T20:25:55.3972173Z 
2025-05-07T20:25:55.3972178Z 
2025-05-07T20:25:55.3972203Z 
2025-05-07T20:25:55.3972208Z 
2025-05-07T20:25:55.3972213Z 
2025-05-07T20:25:55.3972411Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3972668Z 
2025-05-07T20:25:55.3972674Z 
2025-05-07T20:25:55.3972678Z 
2025-05-07T20:25:55.3972683Z 
2025-05-07T20:25:55.3972695Z 
2025-05-07T20:25:55.3972700Z 
2025-05-07T20:25:55.3972705Z 
2025-05-07T20:25:55.3972711Z 
2025-05-07T20:25:55.3972716Z 
2025-05-07T20:25:55.3972721Z 
2025-05-07T20:25:55.3972726Z 
2025-05-07T20:25:55.3972731Z 
2025-05-07T20:25:55.3972736Z 
2025-05-07T20:25:55.3972741Z 
2025-05-07T20:25:55.3972928Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3973195Z 
2025-05-07T20:25:55.3973200Z 
2025-05-07T20:25:55.3973205Z 
2025-05-07T20:25:55.3973210Z 
2025-05-07T20:25:55.3973224Z 
2025-05-07T20:25:55.3973229Z 
2025-05-07T20:25:55.3973234Z 
2025-05-07T20:25:55.3973239Z 
2025-05-07T20:25:55.3973244Z 
2025-05-07T20:25:55.3973249Z 
2025-05-07T20:25:55.3973254Z 
2025-05-07T20:25:55.3973260Z 
2025-05-07T20:25:55.3973270Z 
2025-05-07T20:25:55.3973275Z 
2025-05-07T20:25:55.3973280Z 
2025-05-07T20:25:55.3973482Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3973749Z 
2025-05-07T20:25:55.3973754Z 
2025-05-07T20:25:55.3973759Z 
2025-05-07T20:25:55.3973764Z 
2025-05-07T20:25:55.3973769Z 
2025-05-07T20:25:55.3973774Z 
2025-05-07T20:25:55.3973779Z 
2025-05-07T20:25:55.3973783Z 
2025-05-07T20:25:55.3973788Z 
2025-05-07T20:25:55.3973793Z 
2025-05-07T20:25:55.3973798Z 
2025-05-07T20:25:55.3973803Z 
2025-05-07T20:25:55.3973808Z 
2025-05-07T20:25:55.3973822Z 
2025-05-07T20:25:55.3973827Z 
2025-05-07T20:25:55.3973832Z 
2025-05-07T20:25:55.3974031Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3974306Z 
2025-05-07T20:25:55.3974311Z 
2025-05-07T20:25:55.3974322Z 
2025-05-07T20:25:55.3974327Z 
2025-05-07T20:25:55.3974341Z 
2025-05-07T20:25:55.3974346Z 
2025-05-07T20:25:55.3974352Z 
2025-05-07T20:25:55.3974357Z 
2025-05-07T20:25:55.3974362Z 
2025-05-07T20:25:55.3974367Z 
2025-05-07T20:25:55.3974377Z 
2025-05-07T20:25:55.3974382Z 
2025-05-07T20:25:55.3974388Z 
2025-05-07T20:25:55.3974394Z 
2025-05-07T20:25:55.3974399Z 
2025-05-07T20:25:55.3974404Z 
2025-05-07T20:25:55.3974409Z 
2025-05-07T20:25:55.3974614Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3974935Z 
2025-05-07T20:25:55.3974940Z 
2025-05-07T20:25:55.3974947Z 
2025-05-07T20:25:55.3974953Z 
2025-05-07T20:25:55.3974959Z 
2025-05-07T20:25:55.3974966Z 
2025-05-07T20:25:55.3974972Z 
2025-05-07T20:25:55.3974978Z 
2025-05-07T20:25:55.3974985Z 
2025-05-07T20:25:55.3974992Z 
2025-05-07T20:25:55.3974998Z 
2025-05-07T20:25:55.3975004Z 
2025-05-07T20:25:55.3975011Z 
2025-05-07T20:25:55.3975017Z 
2025-05-07T20:25:55.3975023Z 
2025-05-07T20:25:55.3975030Z 
2025-05-07T20:25:55.3975036Z 
2025-05-07T20:25:55.3975051Z 
2025-05-07T20:25:55.3975289Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3975579Z 
2025-05-07T20:25:55.3975584Z 
2025-05-07T20:25:55.3975726Z [A
2025-05-07T20:25:55.3975868Z 
2025-05-07T20:25:55.3975963Z 
2025-05-07T20:25:55.3976078Z [A[A
2025-05-07T20:25:55.3976192Z 
2025-05-07T20:25:55.3976196Z 
2025-05-07T20:25:55.3976200Z 
2025-05-07T20:25:55.3976304Z [A[A[A
2025-05-07T20:25:55.3976417Z 
2025-05-07T20:25:55.3976421Z 
2025-05-07T20:25:55.3976424Z 
2025-05-07T20:25:55.3976428Z 
2025-05-07T20:25:55.3976532Z [A[A[A[A
2025-05-07T20:25:55.3976647Z 
2025-05-07T20:25:55.3976651Z 
2025-05-07T20:25:55.3976654Z 
2025-05-07T20:25:55.3976658Z 
2025-05-07T20:25:55.3976667Z 
2025-05-07T20:25:55.3976804Z [A[A[A[A[A
2025-05-07T20:25:55.3976934Z 
2025-05-07T20:25:55.3976938Z 
2025-05-07T20:25:55.3976941Z 
2025-05-07T20:25:55.3976945Z 
2025-05-07T20:25:55.3976948Z 
2025-05-07T20:25:55.3976952Z 
2025-05-07T20:25:55.3977060Z [A[A[A[A[A[A
2025-05-07T20:25:55.3977280Z 
2025-05-07T20:25:55.3977284Z 
2025-05-07T20:25:55.3977287Z 
2025-05-07T20:25:55.3977291Z 
2025-05-07T20:25:55.3977295Z 
2025-05-07T20:25:55.3977298Z 
2025-05-07T20:25:55.3977302Z 
2025-05-07T20:25:55.3977422Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3977566Z 
2025-05-07T20:25:55.3977569Z 
2025-05-07T20:25:55.3977573Z 
2025-05-07T20:25:55.3977577Z 
2025-05-07T20:25:55.3977580Z 
2025-05-07T20:25:55.3977584Z 
2025-05-07T20:25:55.3977587Z 
2025-05-07T20:25:55.3977591Z 
2025-05-07T20:25:55.3977708Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3977881Z 
2025-05-07T20:25:55.3977886Z 
2025-05-07T20:25:55.3977891Z 
2025-05-07T20:25:55.3977896Z 
2025-05-07T20:25:55.3977901Z 
2025-05-07T20:25:55.3977906Z 
2025-05-07T20:25:55.3977911Z 
2025-05-07T20:25:55.3977916Z 
2025-05-07T20:25:55.3977921Z 
2025-05-07T20:25:55.3978100Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3978324Z 
2025-05-07T20:25:55.3978329Z 
2025-05-07T20:25:55.3978334Z 
2025-05-07T20:25:55.3978340Z 
2025-05-07T20:25:55.3978344Z 
2025-05-07T20:25:55.3978358Z 
2025-05-07T20:25:55.3978364Z 
2025-05-07T20:25:55.3978369Z 
2025-05-07T20:25:55.3978374Z 
2025-05-07T20:25:55.3978379Z 
2025-05-07T20:25:55.3978550Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3978779Z 
2025-05-07T20:25:55.3978791Z 
2025-05-07T20:25:55.3978796Z 
2025-05-07T20:25:55.3978801Z 
2025-05-07T20:25:55.3978806Z 
2025-05-07T20:25:55.3978811Z 
2025-05-07T20:25:55.3978816Z 
2025-05-07T20:25:55.3978822Z 
2025-05-07T20:25:55.3978827Z 
2025-05-07T20:25:55.3978832Z 
2025-05-07T20:25:55.3978837Z 
2025-05-07T20:25:55.3979022Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3979208Z 
2025-05-07T20:25:55.3979212Z 
2025-05-07T20:25:55.3979215Z 
2025-05-07T20:25:55.3979219Z 
2025-05-07T20:25:55.3979223Z 
2025-05-07T20:25:55.3979226Z 
2025-05-07T20:25:55.3979230Z 
2025-05-07T20:25:55.3979234Z 
2025-05-07T20:25:55.3979237Z 
2025-05-07T20:25:55.3979241Z 
2025-05-07T20:25:55.3979244Z 
2025-05-07T20:25:55.3979248Z 
2025-05-07T20:25:55.3979384Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3979568Z 
2025-05-07T20:25:55.3979572Z 
2025-05-07T20:25:55.3979576Z 
2025-05-07T20:25:55.3979579Z 
2025-05-07T20:25:55.3979583Z 
2025-05-07T20:25:55.3979586Z 
2025-05-07T20:25:55.3979590Z 
2025-05-07T20:25:55.3979593Z 
2025-05-07T20:25:55.3979605Z 
2025-05-07T20:25:55.3979609Z 
2025-05-07T20:25:55.3979613Z 
2025-05-07T20:25:55.3979616Z 
2025-05-07T20:25:55.3979620Z 
2025-05-07T20:25:55.3979752Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3979937Z 
2025-05-07T20:25:55.3979941Z 
2025-05-07T20:25:55.3979951Z 
2025-05-07T20:25:55.3979954Z 
2025-05-07T20:25:55.3979958Z 
2025-05-07T20:25:55.3979961Z 
2025-05-07T20:25:55.3979965Z 
2025-05-07T20:25:55.3979969Z 
2025-05-07T20:25:55.3979972Z 
2025-05-07T20:25:55.3979976Z 
2025-05-07T20:25:55.3979979Z 
2025-05-07T20:25:55.3979983Z 
2025-05-07T20:25:55.3979987Z 
2025-05-07T20:25:55.3979991Z 
2025-05-07T20:25:55.3980129Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3980327Z 
2025-05-07T20:25:55.3980331Z 
2025-05-07T20:25:55.3980338Z 
2025-05-07T20:25:55.3980342Z 
2025-05-07T20:25:55.3980345Z 
2025-05-07T20:25:55.3980349Z 
2025-05-07T20:25:55.3980352Z 
2025-05-07T20:25:55.3980356Z 
2025-05-07T20:25:55.3980359Z 
2025-05-07T20:25:55.3980363Z 
2025-05-07T20:25:55.3980462Z 
2025-05-07T20:25:55.3980467Z 
2025-05-07T20:25:55.3980470Z 
2025-05-07T20:25:55.3980474Z 
2025-05-07T20:25:55.3980477Z 
2025-05-07T20:25:55.3980631Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3980826Z 
2025-05-07T20:25:55.3980830Z 
2025-05-07T20:25:55.3980833Z 
2025-05-07T20:25:55.3980837Z 
2025-05-07T20:25:55.3980840Z 
2025-05-07T20:25:55.3980844Z 
2025-05-07T20:25:55.3980847Z 
2025-05-07T20:25:55.3980851Z 
2025-05-07T20:25:55.3980854Z 
2025-05-07T20:25:55.3980858Z 
2025-05-07T20:25:55.3980861Z 
2025-05-07T20:25:55.3980874Z 
2025-05-07T20:25:55.3980880Z 
2025-05-07T20:25:55.3980886Z 
2025-05-07T20:25:55.3980890Z 
2025-05-07T20:25:55.3980894Z 
2025-05-07T20:25:55.3981042Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3981348Z 
2025-05-07T20:25:55.3981352Z 
2025-05-07T20:25:55.3981355Z 
2025-05-07T20:25:55.3981365Z 
2025-05-07T20:25:55.3981368Z 
2025-05-07T20:25:55.3981372Z 
2025-05-07T20:25:55.3981375Z 
2025-05-07T20:25:55.3981379Z 
2025-05-07T20:25:55.3981388Z 
2025-05-07T20:25:55.3981392Z 
2025-05-07T20:25:55.3981396Z 
2025-05-07T20:25:55.3981399Z 
2025-05-07T20:25:55.3981403Z 
2025-05-07T20:25:55.3981406Z 
2025-05-07T20:25:55.3981410Z 
2025-05-07T20:25:55.3981413Z 
2025-05-07T20:25:55.3981417Z 
2025-05-07T20:25:55.3981571Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3981783Z 
2025-05-07T20:25:55.3981787Z 
2025-05-07T20:25:55.3981790Z 
2025-05-07T20:25:55.3981794Z 
2025-05-07T20:25:55.3981797Z 
2025-05-07T20:25:55.3981801Z 
2025-05-07T20:25:55.3981804Z 
2025-05-07T20:25:55.3981808Z 
2025-05-07T20:25:55.3981811Z 
2025-05-07T20:25:55.3981815Z 
2025-05-07T20:25:55.3981818Z 
2025-05-07T20:25:55.3981822Z 
2025-05-07T20:25:55.3981826Z 
2025-05-07T20:25:55.3981829Z 
2025-05-07T20:25:55.3981833Z 
2025-05-07T20:25:55.3981841Z 
2025-05-07T20:25:55.3981845Z 
2025-05-07T20:25:55.3981854Z 
2025-05-07T20:25:55.3982013Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3982219Z 
2025-05-07T20:25:55.3982223Z 
2025-05-07T20:25:55.3982331Z [A
2025-05-07T20:25:55.3982431Z 
2025-05-07T20:25:55.3982435Z 
2025-05-07T20:25:55.3982534Z [A[A
2025-05-07T20:25:55.3982642Z 
2025-05-07T20:25:55.3982646Z 
2025-05-07T20:25:55.3982649Z 
2025-05-07T20:25:55.3982756Z [A[A[A
2025-05-07T20:25:55.3982868Z 
2025-05-07T20:25:55.3982872Z 
2025-05-07T20:25:55.3982875Z 
2025-05-07T20:25:55.3982879Z 
2025-05-07T20:25:55.3982990Z [A[A[A[A
2025-05-07T20:25:55.3983107Z 
2025-05-07T20:25:55.3983110Z 
2025-05-07T20:25:55.3983120Z 
2025-05-07T20:25:55.3983123Z 
2025-05-07T20:25:55.3983127Z 
2025-05-07T20:25:55.3983234Z [A[A[A[A[A
2025-05-07T20:25:55.3983355Z 
2025-05-07T20:25:55.3983359Z 
2025-05-07T20:25:55.3983362Z 
2025-05-07T20:25:55.3983366Z 
2025-05-07T20:25:55.3983369Z 
2025-05-07T20:25:55.3983386Z 
2025-05-07T20:25:55.3983496Z [A[A[A[A[A[A
2025-05-07T20:25:55.3983620Z 
2025-05-07T20:25:55.3983623Z 
2025-05-07T20:25:55.3983627Z 
2025-05-07T20:25:55.3983630Z 
2025-05-07T20:25:55.3983634Z 
2025-05-07T20:25:55.3983637Z 
2025-05-07T20:25:55.3983646Z 
2025-05-07T20:25:55.3983764Z [A[A[A[A[A[A[A
2025-05-07T20:25:55.3983900Z 
2025-05-07T20:25:55.3983903Z 
2025-05-07T20:25:55.3983907Z 
2025-05-07T20:25:55.3983910Z 
2025-05-07T20:25:55.3983914Z 
2025-05-07T20:25:55.3983918Z 
2025-05-07T20:25:55.3983921Z 
2025-05-07T20:25:55.3983925Z 
2025-05-07T20:25:55.3984063Z [A[A[A[A[A[A[A[A done
2025-05-07T20:25:55.6022354Z Preparing transaction: | / done
2025-05-07T20:25:59.5087946Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:26:00.1179946Z Executing transaction: \ | / - \ | done
2025-05-07T20:26:02.2909621Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:02.2910134Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:02.2911202Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:02.2911888Z 
2025-05-07T20:26:02.2923723Z 
2025-05-07T20:26:02.2924583Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:02.2925330Z 
2025-05-07T20:26:02.2937214Z 
2025-05-07T20:26:02.2937441Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:02.2942821Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:02.2946965Z 
2025-05-07T20:26:02.4587618Z 
2025-05-07T20:26:02.4592840Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:02.4596783Z 
2025-05-07T20:26:02.4616746Z 
2025-05-07T20:26:02.4617227Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:02.4995067Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:04.4009299Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:04.4652300Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:04.4652929Z 
2025-05-07T20:26:04.8853943Z 
2025-05-07T20:26:04.8861957Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:04.9217820Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:04.9218363Z 
2025-05-07T20:26:05.3600866Z 
2025-05-07T20:26:05.3601217Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:05.3602150Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:05.3602898Z 
2025-05-07T20:26:05.7838123Z 
2025-05-07T20:26:07.8083280Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:09.8286925Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:11.8608313Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:11.8609145Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:13.8861211Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:15.7712773Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:15.7713074Z 
2025-05-07T20:26:15.8342055Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:19.6905643Z /tmp/tmpfly9oh1w: line 3: clang: command not found
2025-05-07T20:26:19.6905933Z 
2025-05-07T20:26:19.6908034Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:19.7536241Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:19.7536581Z 
2025-05-07T20:26:19.7556471Z total 36
2025-05-07T20:26:19.7556768Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:25 .
2025-05-07T20:26:19.7557172Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:19.7557623Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:19.7558141Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:19.7558631Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:19.7559103Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:19.7559691Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:19.7560321Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:26:19.7560735Z 
2025-05-07T20:26:19.7561035Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:19.7561770Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:19.7562196Z 
2025-05-07T20:26:19.7580617Z 
2025-05-07T20:26:19.7580947Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:19.7581306Z 
2025-05-07T20:26:21.7237284Z 
2025-05-07T20:26:21.7237856Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:21.7238437Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:21.7238835Z 
2025-05-07T20:26:22.1515611Z 
2025-05-07T20:26:22.1516027Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:22.1516305Z 
2025-05-07T20:26:24.0404176Z -allow-unsupported-compiler
2025-05-07T20:26:24.0404410Z 
2025-05-07T20:26:24.1025693Z 
2025-05-07T20:26:24.1026114Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:24.1026657Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:24.1026991Z 
2025-05-07T20:26:26.0524714Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:26.0525582Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:26.0526019Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:26.0526339Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:26.0526750Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:26.0527116Z #define _STL_PAIR_H 1
2025-05-07T20:26:26.0527454Z #define __cpp_attributes 200809L
2025-05-07T20:26:26.0527910Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:26.0528311Z #define __DELETE_THROW throw()
2025-05-07T20:26:26.0528573Z #define _PTRDIFF_T_ 
2025-05-07T20:26:26.0528809Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:26.0529096Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:26.0529522Z #define _IO_LEFT 02
2025-05-07T20:26:26.0529803Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:26.0530068Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:26.0530342Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:26.0531114Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:26.0531560Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:26.0531839Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:26.0532226Z #define _IOS_OUTPUT 2
2025-05-07T20:26:26.0542771Z #define __SM_100_RT_HPP__ 
2025-05-07T20:26:26.0543273Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:26.0543798Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:26.0544263Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:26.0544668Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:26.0545065Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:26.0546159Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:26.0547294Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:26.0547628Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:26.0547937Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:26.0548276Z #define _T_WCHAR_ 
2025-05-07T20:26:26.0548513Z #define stdout stdout
2025-05-07T20:26:26.0548884Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:26.0549281Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:26.0549537Z #define __flexarr []
2025-05-07T20:26:26.0549778Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:26.0550093Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:26.0550438Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:26.0550692Z #define _MATH_H 1
2025-05-07T20:26:26.0550971Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:26.0551317Z #define __S64_TYPE long int
2025-05-07T20:26:26.0551582Z #define __stub_fchflags 
2025-05-07T20:26:26.0551843Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:26.0552141Z #define __SQUAD_TYPE long int
2025-05-07T20:26:26.0552412Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:26.0552716Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:26.0553060Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:26.0553324Z #define NL_NMAX INT_MAX
2025-05-07T20:26:26.0553564Z #define _BITS_TIME_H 1
2025-05-07T20:26:26.0553857Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:26.0554217Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:26.0554527Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:26.0554881Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:26.0555287Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:26.0555660Z #define __CHAR_BIT__ 8
2025-05-07T20:26:26.0555919Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0556243Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:26.0556547Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:26.0556814Z #define FP_NAN 0
2025-05-07T20:26:26.0557091Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:26.0557515Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:26.0557913Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:26.0558197Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:26.0558466Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:26.0558726Z #define __SM_80_RT_H__ 
2025-05-07T20:26:26.0558950Z #define _NEW 
2025-05-07T20:26:26.0559186Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:26.0559476Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:26.0559843Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:26.0560257Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:26.0560502Z #define __USE_ANSI 1
2025-05-07T20:26:26.0560787Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:26.0561194Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:26.0561564Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:26.0561870Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:26.0562266Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:26.0562559Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:26.0562842Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:26.0563128Z #define PIPE_BUF 4096
2025-05-07T20:26:26.0563453Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:26.0563915Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:26.0564335Z #define ADJ_TICK 0x4000
2025-05-07T20:26:26.0564617Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:26.0564942Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:26.0565203Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:26.0565532Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:26.0566003Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:26.0566635Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:26.0567002Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:26.0567267Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:26.0567548Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0567829Z #define __cpp_static_assert 201411L
2025-05-07T20:26:26.0568118Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:26.0568388Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:26.0568659Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:26.0568945Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:26.0569252Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:26.0569529Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:26.0569835Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0570198Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:26.0570546Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:26.0570830Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:26.0571152Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0571515Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:26.0571876Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:26.0572364Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:26.0572665Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:26.0572986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:26.0573313Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:26.0573722Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:26.0574188Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:26.0574491Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:26.0574759Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:26.0575047Z #define __GCC_IEC_559 2
2025-05-07T20:26:26.0575338Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:26.0575683Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:26.0575950Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:26.0576216Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:26.0576478Z #define _IOFBF 0
2025-05-07T20:26:26.0576700Z #define __USE_BSD 1
2025-05-07T20:26:26.0576922Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:26.0577196Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:26.0577476Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:26.0577725Z #define _IO_NO_WRITES 8
2025-05-07T20:26:26.0577983Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:26.0578342Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:26.0578697Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:26.0578999Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:26.0579323Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:26.0579618Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:26.0579882Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:26.0580155Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:26.0580480Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:26.0580866Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:26.0581349Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:26.0581664Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:26.0581969Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:26.0582303Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:26.0582617Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:26.0582925Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:26.0583197Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:26.0583472Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:26.0584119Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:26.0584711Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:26.0585043Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:26.0585470Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:26.0585772Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:26.0586055Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:26.0586328Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:26.0586635Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:26.0586972Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:26.0587277Z #define RAND_MAX 2147483647
2025-05-07T20:26:26.0587547Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:26.0587873Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0588193Z #define __SM_90_RT_H__ 
2025-05-07T20:26:26.0588440Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:26.0588698Z #define __COMPAR_FN_T 
2025-05-07T20:26:26.0588946Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.0589211Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:26.0589687Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:26.0590211Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.0590563Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.0590921Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:26.0591226Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:26.0591566Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:26.0591883Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:26.0592390Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:26.0592939Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:26.0593276Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:26.0593546Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:26.0593851Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:26.0594185Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:26.0594469Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:26.0594740Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:26.0595010Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:26.0595251Z #define __u_char_defined 
2025-05-07T20:26:26.0595572Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:26.0595940Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:26.0596197Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:26.0596446Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:26.0596750Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:26.0597193Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:26.0597618Z #define FP_INFINITE 1
2025-05-07T20:26:26.0597993Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:26.0598418Z #define _IO_pid_t __pid_t
2025-05-07T20:26:26.0598672Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:26.0598934Z #define __LEAF , __leaf__
2025-05-07T20:26:26.0599180Z #define PATH_MAX 4096
2025-05-07T20:26:26.0599429Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:26.0599767Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:26.0600099Z #define _LIMITS_H___ 
2025-05-07T20:26:26.0600323Z #define __size_t 
2025-05-07T20:26:26.0600561Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:26.0601203Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:26.0601783Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:26.0602091Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:26.0602424Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:26.0602690Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:26.0603043Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:26.0603447Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:26.0603744Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:26.0604065Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:26.0604355Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:26.0604638Z #define __INT8_C(c) c
2025-05-07T20:26:26.0604992Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:26.0605287Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:26.0605552Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:26.0605819Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:26.0606064Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:26.0606822Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:26.0607156Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0607482Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:26.0607756Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:26.0608033Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:26.0608293Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:26.0608613Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:26.0608919Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:26.0609280Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:26.0609665Z #define NFDBITS __NFDBITS
2025-05-07T20:26:26.0609927Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:26.0610223Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:26.0610542Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:26.0610865Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:26.0611132Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:26.0611417Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:26.0611723Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:26.0612128Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:26.0612545Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:26.0612913Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:26.0613201Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:26.0613515Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:26.0613844Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:26.0614167Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:26.0614506Z #define __daddr_t_defined 
2025-05-07T20:26:26.0614761Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:26.0615040Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:26.0615361Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:26.0615877Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:26.0616372Z #define _ACRTIMP 
2025-05-07T20:26:26.0616598Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:26.0616860Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:26.0617154Z #define _IOS_BIN 128
2025-05-07T20:26:26.0617513Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:26.0617930Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0618202Z #define UNDERFLOW 4
2025-05-07T20:26:26.0618422Z #define NAME_MAX 255
2025-05-07T20:26:26.0618662Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:26.0618930Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:26.0619212Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:26.0619512Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:26.0619888Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:26.0620282Z #define __ptr_t void *
2025-05-07T20:26:26.0620809Z #define M_E 2.7182818284590452354
2025-05-07T20:26:26.0621086Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:26.0621355Z #define __USE_ISOCXX11 1
2025-05-07T20:26:26.0621624Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:26.0621939Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:26.0622237Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:26.0622517Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:26.0622803Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:26.0623124Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:26.0623388Z #define __linux 1
2025-05-07T20:26:26.0623619Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:26.0623888Z #define cudaDeviceMask 0xff
2025-05-07T20:26:26.0624158Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:26.0624594Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:26.0624870Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:26.0625159Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:26.0625476Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:26.0625781Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:26.0626079Z #define _BITS_TYPES_H 1
2025-05-07T20:26:26.0626368Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:26.0626708Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:26.0627014Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:26.0627298Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:26.0627592Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:26.0627878Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:26.0628686Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:26.0629531Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:26.0629813Z #define __unix 1
2025-05-07T20:26:26.0630034Z #define MATH_ERRNO 1
2025-05-07T20:26:26.0630281Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:26.0630565Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:26.0630830Z #define __SM_100_RT_H__ 
2025-05-07T20:26:26.0631084Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:26.0631366Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:26.0631659Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.0631939Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:26.0632243Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:26.0632708Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:26.0633182Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:26.0633487Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:26.0633740Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:26.0634020Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:26.0634311Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:26.0634574Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:26.0634812Z #define __SIZE_T 
2025-05-07T20:26:26.0635060Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:26.0635382Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:26.0635678Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:26.0635940Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:26.0636212Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:26.0636469Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:26.0636863Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:26.0637306Z #define __WAIT_STATUS void *
2025-05-07T20:26:26.0637566Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:26.0637835Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:26.0638107Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:26.0638394Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:26.0638673Z #define __WINT_MIN__ 0U
2025-05-07T20:26:26.0639275Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:26.0640029Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:26.0640340Z #define WUNTRACED 2
2025-05-07T20:26:26.0640575Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:26.0640854Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:26.0641137Z #define NZERO 20
2025-05-07T20:26:26.0641368Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:26.0641650Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:26.0641941Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:26.0642236Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:26.0642496Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.0642779Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:26.0643059Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:26.0643341Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:26.0643618Z #define EXIT_FAILURE 1
2025-05-07T20:26:26.0644003Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:26.0644269Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:26.0644533Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:26.0644791Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:26.0645080Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:26.0645428Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:26.0645788Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:26.0646087Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:26.0646343Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:26.0646612Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:26.0646915Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:26.0647230Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:26.0647521Z #define SEEK_DATA 3
2025-05-07T20:26:26.0647755Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:26.0648059Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:26.0648479Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:26.0649672Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:26:26.0650391Z 
2025-05-07T20:26:26.0650488Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:26.0650750Z #define __INT64_C(c) c ## L
2025-05-07T20:26:26.0651017Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:26.0651358Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:26.0651687Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:26.0652069Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:26.0652401Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:26.0652705Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:26.0652959Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:26.0653193Z #define WSTOPPED 2
2025-05-07T20:26:26.0653433Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:26.0653722Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:26.0654002Z #define FP_NORMAL 4
2025-05-07T20:26:26.0654269Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:26.0654557Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:26.0654789Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:26.0655051Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:26.0655336Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:26.0655604Z #define cudaTextureType1D 0x01
2025-05-07T20:26:26.0655879Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:26.0656147Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:26.0656414Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:26.0656712Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:26.0657148Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:26.0657604Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:26.0657864Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:26.0658144Z #define _POSIX_SOURCE 1
2025-05-07T20:26:26.0658402Z #define cudaTextureType2D 0x02
2025-05-07T20:26:26.0658668Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:26.0658943Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:26.0659263Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:26.0659528Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:26.0659955Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:26.0660302Z #define cudaTextureType3D 0x03
2025-05-07T20:26:26.0660573Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:26.0660836Z #define CLOCK_REALTIME 0
2025-05-07T20:26:26.0661089Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:26.0661361Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:26.0661669Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:26.0661952Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:26.0662235Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:26.0675906Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:26.0676256Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:26.0676566Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:26.0677034Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:26.0677305Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:26.0677552Z #define __GLIBC__ 2
2025-05-07T20:26:26.0677759Z #define __END_DECLS }
2025-05-07T20:26:26.0677994Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:26.0678354Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:26.0678721Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:26.0678965Z #define WCONTINUED 8
2025-05-07T20:26:26.0679187Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:26.0679426Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:26.0679692Z #define _ALLOCA_H 1
2025-05-07T20:26:26.0679907Z #define __host__ __location__(host)
2025-05-07T20:26:26.0680326Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:26.0680760Z #define __SLONG32_TYPE int
2025-05-07T20:26:26.0681027Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:26.0681310Z #define _SYS_SELECT_H 1
2025-05-07T20:26:26.0681540Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:26.0681780Z #define _IOS_NOCREATE 32
2025-05-07T20:26:26.0682018Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:26.0682284Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:26.0682578Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:26.0682859Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:26.0683137Z #define __global__ __location__(global)
2025-05-07T20:26:26.0683425Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:26.0683686Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:26.0683978Z #define __DBL_DIG__ 15
2025-05-07T20:26:26.0684243Z #define TIME_UTC 1
2025-05-07T20:26:26.0684466Z #define __FLT32_DIG__ 6
2025-05-07T20:26:26.0684802Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:26.0685202Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:26.0685529Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:26.0685853Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:26.0686165Z #define _G_BUFSIZ 8192
2025-05-07T20:26:26.0686476Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:26.0686857Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:26.0687158Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:26.0687453Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:26.0687749Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:26.0687986Z #define __GXX_WEAK__ 1
2025-05-07T20:26:26.0688241Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.0688547Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:26.0688802Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:26.0689102Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:26.0689445Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:26.0689725Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:26.0690006Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:26.0690307Z #define _G_config_h 1
2025-05-07T20:26:26.0690593Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:26.0690932Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:26.0691215Z #define _GCC_WCHAR_T 
2025-05-07T20:26:26.0691450Z #define TMP_MAX 238328
2025-05-07T20:26:26.0691685Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:26.0692252Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:26.0692520Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.0692794Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:26.0693074Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:26.0693367Z #define _IO_SKIPWS 01
2025-05-07T20:26:26.0693770Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:26.0694277Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:26.0694564Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:26.0694906Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:26.0695273Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:26.0695648Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:26.0696103Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:26.0696353Z #define le32toh(x) (x)
2025-05-07T20:26:26.0696599Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:26.0696851Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:26.0697195Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:26.0697551Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:26.0697956Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:26.0698380Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:26.0698642Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:26.0698909Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:26.0699178Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:26.0699456Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:26.0699995Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:26.0700505Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:26.0700813Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:26.0701174Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:26.0701491Z #define _WCHAR_T_ 
2025-05-07T20:26:26.0701712Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:26.0702086Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:26.0702479Z #define RTSIG_MAX 32
2025-05-07T20:26:26.0702702Z #define _STDDEF_H 
2025-05-07T20:26:26.0702928Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:26.0703203Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:26.0703461Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:26.0703794Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:26.0704240Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:26.0704575Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:26.0704862Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:26.0705332Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:26.0705879Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:26.0706552Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:26.0706982Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:26.0707409Z #define __unix__ 1
2025-05-07T20:26:26.0707726Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.0708015Z #define __INT_WIDTH__ 32
2025-05-07T20:26:26.0708264Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:26.0708503Z #define _IONBF 2
2025-05-07T20:26:26.0708948Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:26.0709726Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:26.0710275Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:26.0710536Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:26.0710802Z #define __UINT16_C(c) c
2025-05-07T20:26:26.0711047Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:26.0711332Z #define STA_DEL 0x0020
2025-05-07T20:26:26.0711571Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:26:26.0711830Z #define __id_t_defined 
2025-05-07T20:26:26.0712381Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:26.0712837Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:26.0713273Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:26.0713546Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:26.0713801Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:26.0714094Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:26.0714378Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:26.0714652Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:26.0714914Z #define SING 2
2025-05-07T20:26:26.0715134Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:26.0715409Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0715708Z #define cudaStreamDefault 0x00
2025-05-07T20:26:26.0716062Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:26.0716628Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:26.0716895Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:26.0717165Z #define __gnu_linux__ 1
2025-05-07T20:26:26.0717412Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:26.0717668Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:26.0717967Z #define MAX_INPUT 255
2025-05-07T20:26:26.0718213Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:26.0718539Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:26.0718913Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:26.0719234Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:26.0719502Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:26.0719895Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:26.0720325Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:26.0720656Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:26.0721012Z #define _Mfloat_ float
2025-05-07T20:26:26.0721288Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:26.0721606Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.0721892Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:26.0722228Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:26:26.0722775Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:26.0723275Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0723547Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:26.0723878Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:26.0724279Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:26.0724584Z #define __USE_ISOC11 1
2025-05-07T20:26:26.0724822Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:26.0725061Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:26.0725307Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:26.0725574Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:26.0725882Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:26.0726198Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:26.0726514Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:26.0726853Z #define __THROW throw ()
2025-05-07T20:26:26.0727111Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:26.0727400Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0727760Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.0728120Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:26.0728399Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:26.0728665Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:26.0728936Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:26.0729194Z #define L_tmpnam 20
2025-05-07T20:26:26.0729428Z #define ___int_wchar_t_h 
2025-05-07T20:26:26.0729773Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:26.0730153Z #define isascii(c) __isascii (c)
2025-05-07T20:26:26.0730415Z #define _T_PTRDIFF 
2025-05-07T20:26:26.0730730Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:26.0731082Z #define toascii(c) __toascii (c)
2025-05-07T20:26:26.0731344Z #define __GNUC__ 11
2025-05-07T20:26:26.0731695Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:26.0732135Z #define __GXX_RTTI 1
2025-05-07T20:26:26.0732385Z #define __pie__ 2
2025-05-07T20:26:26.0732598Z #define __MMX__ 1
2025-05-07T20:26:26.0732824Z #define __cudaCDP2Malloc 
2025-05-07T20:26:26.0733076Z #define __timespec_defined 1
2025-05-07T20:26:26.0733328Z #define L_ctermid 9
2025-05-07T20:26:26.0733564Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:26.0733864Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:26.0734262Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:26.0734640Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:26.0734902Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:26.0735195Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:26.0735603Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:26.0735920Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:26.0736204Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:26.0736653Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:26.0737411Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:26.0738020Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:26.0738327Z #define __USE_SVID 1
2025-05-07T20:26:26.0738584Z #define __constant__ __location__(constant)
2025-05-07T20:26:26.0738895Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:26.0739197Z #define __device__ __location__(device)
2025-05-07T20:26:26.0739531Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:26.0739856Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:26.0740127Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:26.0740420Z #define CUDART_DEVICE __device__
2025-05-07T20:26:26.0740767Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:26.0741142Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:26.0741437Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:26.0741811Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:26.0742191Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:26.0742444Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:26.0742814Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:26.0743237Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:26.0743555Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:26.0743851Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:26.0744145Z #define NGROUPS_MAX 65536
2025-05-07T20:26:26.0744402Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:26.0744669Z #define __USE_ISOC95 1
2025-05-07T20:26:26.0744891Z #define _TIME_H 1
2025-05-07T20:26:26.0745160Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:26.0745491Z #define __USE_ISOC99 1
2025-05-07T20:26:26.0745821Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:26.0746191Z #define HOST_NAME_MAX 64
2025-05-07T20:26:26.0746448Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:26.0746713Z #define _IOS_ATEND 4
2025-05-07T20:26:26.0746950Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:26.0747280Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:26.0747689Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:26.0748033Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:26.0748319Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:26.0748646Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:26.0748959Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:26.0749218Z #define _STDIO_H 1
2025-05-07T20:26:26.0749619Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:26.0750100Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:26.0750463Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:26.0750932Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:26.0751232Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:26.0751497Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:26.0751773Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:26.0752069Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:26.0752370Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0752691Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:26.0752969Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:26.0753247Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:26.0753557Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:26.0753837Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:26.0754154Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:26.0754530Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:26.0754990Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:26.0755239Z #define __USE_XOPEN 1
2025-05-07T20:26:26.0755480Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:26.0755935Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:26.0756383Z #define __USE_XOPEN2K 1
2025-05-07T20:26:26.0756624Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:26.0756897Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:26.0757198Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:26.0757469Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:26.0757996Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:26.0758528Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:26.0758812Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:26.0759175Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:26.0759566Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:26.0759955Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:26.0760349Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:26.0760626Z #define __glibcxx_integral_traps true
2025-05-07T20:26:26.0760923Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:26.0761177Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:26.0761436Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:26.0761703Z #define _IOS_TRUNC 16
2025-05-07T20:26:26.0761932Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:26.0762184Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:26.0762480Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:26.0762777Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:26.0763147Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:26.0763544Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:26.0763824Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:26.0764086Z #define _IO_UNITBUF 020000
2025-05-07T20:26:26.0764340Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:26.0764607Z #define __FD_SETSIZE 1024
2025-05-07T20:26:26.0764859Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:26.0765133Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:26.0765480Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:26.0765834Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:26.0766105Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:26.0766418Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:26.0766736Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:26.0767013Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:26.0767326Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:26.0767671Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:26.0767957Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:26.0768291Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:26.0768585Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:26.0768854Z #define __USE_POSIX199506 1
2025-05-07T20:26:26.0769109Z #define _FEATURES_H 1
2025-05-07T20:26:26.0769356Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:26.0769749Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:26.0770333Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:26.0770672Z #define __stub_getmsg 
2025-05-07T20:26:26.0770901Z #define _IO_FIXED 010000
2025-05-07T20:26:26.0771177Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:26.0771498Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:26.0771777Z #define __stub_setlogin 
2025-05-07T20:26:26.0772170Z #define __stub_fattach 
2025-05-07T20:26:26.0772420Z #define __cplusplus 201703L
2025-05-07T20:26:26.0772693Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:26.0772973Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:26.0773234Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:26.0773516Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:26.0774049Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:26.0774685Z #define _IO_INTERNAL 010
2025-05-07T20:26:26.0774934Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:26.0775268Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:26.0775634Z #define __dev_t_defined 
2025-05-07T20:26:26.0775876Z #define __DEPRECATED 1
2025-05-07T20:26:26.0776103Z #define __S32_TYPE int
2025-05-07T20:26:26.0776360Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:26.0776661Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:26.0776922Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:26.0777176Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:26.0777787Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:26.0778431Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:26.0778741Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:26.0779090Z #define OVERFLOW 3
2025-05-07T20:26:26.0779337Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:26.0779650Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:26.0779936Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.0780279Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:26.0780613Z #define __SSE2_MATH__ 1
2025-05-07T20:26:26.0780863Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:26.0781178Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.0781486Z #define _IO_STDIO_H 
2025-05-07T20:26:26.0781731Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:26.0782026Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:26.0782350Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:26.0782649Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0782967Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:26.0783237Z #define __amd64 1
2025-05-07T20:26:26.0783460Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:26.0783732Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:26.0784017Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:26.0784358Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:26.0784670Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:26.0784941Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:26.0785245Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:26.0785518Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:26.0785778Z #define __bounded 
2025-05-07T20:26:26.0786009Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:26.0786280Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:26.0786576Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:26.0786866Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:26.0787131Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:26.0787411Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0787738Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:26.0788154Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:26.0788565Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:26.0788840Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:26.0789183Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:26.0789535Z #define STA_PLL 0x0001
2025-05-07T20:26:26.0789784Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:26.0790150Z #define __GNUG__ 11
2025-05-07T20:26:26.0790391Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:26.0790660Z #define _T_WCHAR 
2025-05-07T20:26:26.0790901Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:26.0791188Z #define __specialization_static 
2025-05-07T20:26:26.0791497Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:26.0791819Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:26.0792079Z #define cudaArraySparse 0x40
2025-05-07T20:26:26.0792354Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:26.0792640Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:26.0792940Z #define _WCHAR_T 
2025-05-07T20:26:26.0793168Z #define __cudaCDP2Free 
2025-05-07T20:26:26.0793824Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:26.0794684Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:26.0795113Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:26.0795561Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:26.0795844Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:26.0796108Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:26.0796449Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:26.0796806Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:26.0797048Z #define __NO_CTYPE 1
2025-05-07T20:26:26.0797279Z #define __stub_bdflush 
2025-05-07T20:26:26.0797647Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:26.0806041Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:26.0806732Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:26.0807042Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:26.0807323Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:26.0807638Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:26.0807939Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:26.0808297Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:26.0808643Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:26.0808934Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:26.0809212Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:26.0809564Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:26.0809919Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:26.0810206Z #define _IO_STDIO 040000
2025-05-07T20:26:26.0810538Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:26.0810933Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:26.0811258Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:26.0811549Z #define _PTRDIFF_T 
2025-05-07T20:26:26.0811778Z #define _MOVE_H 1
2025-05-07T20:26:26.0812106Z #define __cpp_hex_float 201603L
2025-05-07T20:26:26.0812368Z #define ADJ_TAI 0x0080
2025-05-07T20:26:26.0812603Z #define __ptrvalue 
2025-05-07T20:26:26.0812836Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:26.0813090Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:26.0813382Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:26.0813692Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:26.0813944Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:26.0814240Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:26.0814646Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:26.0815027Z #define __USE_GNU 1
2025-05-07T20:26:26.0815262Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:26.0815542Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:26.0815816Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:26.0816201Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:26.0816598Z #define WEXITED 4
2025-05-07T20:26:26.0816815Z #define _IO_NO_READS 4
2025-05-07T20:26:26.0817112Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:26.0817464Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:26.0818060Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:26.0818361Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:26.0818681Z #define __uid_t_defined 
2025-05-07T20:26:26.0818934Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:26.0819216Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:26.0819491Z #define WNOHANG 1
2025-05-07T20:26:26.0819740Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:26.0820048Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:26.0820326Z #define cudaEventDefault 0x00
2025-05-07T20:26:26.0820632Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:26.0820958Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:26.0821194Z #define __x86_64 1
2025-05-07T20:26:26.0821433Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:26.0821991Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.0822472Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:26.0822987Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:26.0823429Z #define __PTRDIFF_T 
2025-05-07T20:26:26.0823753Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:26.0824141Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:26.0824427Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.0824723Z #define _Mlong_double_ long double
2025-05-07T20:26:26.0825003Z #define __cpp_lambdas 200907L
2025-05-07T20:26:26.0825266Z #define _IO_DEC 020
2025-05-07T20:26:26.0825503Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:26.0825773Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:26.0826069Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:26.0826359Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:26.0826626Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:26.0826931Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:26.0827271Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:26.0827549Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:26.0827837Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:26.0828159Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:26.0828528Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:26.0828920Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:26.0829210Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:26.0829509Z #define __cpp_template_auto 201606L
2025-05-07T20:26:26.0829866Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:26.0830243Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:26.0830519Z #define __key_t_defined 
2025-05-07T20:26:26.0830768Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:26.0831145Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:26.0831631Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:26.0831999Z #define __GNUC_VA_LIST 
2025-05-07T20:26:26.0832347Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:26.0832740Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:26.0833013Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:26.0833294Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:26.0833594Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:26.0833850Z #define __WCOREFLAG 0x80
2025-05-07T20:26:26.0834103Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:26.0834413Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:26.0834696Z #define __LP64__ 1
2025-05-07T20:26:26.0834943Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:26.0835270Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:26.0835559Z #define _IO_off64_t __off64_t
2025-05-07T20:26:26.0835822Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0836093Z #define __time_t_defined 1
2025-05-07T20:26:26.0836354Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:26.0836800Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:26.0837177Z #define __USE_UNIX98 1
2025-05-07T20:26:26.0837425Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:26.0837702Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:26.0837971Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:26.0838273Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:26.0838591Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:26.0838847Z #define SEEK_CUR 1
2025-05-07T20:26:26.0839086Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.0839361Z #define _ASSERT_H 1
2025-05-07T20:26:26.0839939Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:26.0840586Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:26.0840957Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:26.0841217Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:26.0841480Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:26.0841756Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:26.0842145Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:26.0842558Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:26.0843234Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:26.0843926Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:26.0844259Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:26.0844611Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:26.0844994Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:26.0845269Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.0845552Z #define cudaArrayDefault 0x00
2025-05-07T20:26:26.0845843Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:26.0846138Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:26.0846418Z #define TLOSS 5
2025-05-07T20:26:26.0846648Z #define __ssize_t_defined 
2025-05-07T20:26:26.0846905Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:26:26.0847177Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:26.0847475Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:26.0847763Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:26.0848044Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:26.0848336Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:26.0848650Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:26.0848946Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:26.0849235Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:26.0849529Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:26.0849792Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:26.0850126Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:26.0850497Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:26.0850742Z #define __cdecl 
2025-05-07T20:26:26.0850979Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:26.0851321Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:26.0851654Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:26.0851909Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:26.0852318Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:26.0852617Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:26.0852884Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:26.0853202Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:26.0853543Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:26.0853963Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:26.0854453Z #define ADJ_NANO 0x2000
2025-05-07T20:26:26.0854767Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:26.0855133Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:26.0855426Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:26.0855698Z #define __FLT_DIG__ 6
2025-05-07T20:26:26.0856147Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:26.0856553Z #define __NO_INLINE__ 1
2025-05-07T20:26:26.0856865Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:26.0857223Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:26.0857490Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:26.0857754Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:26.0858050Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:26.0858329Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:26.0858628Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:26.0858928Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:26.0859317Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:26.0859738Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:26.0860092Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:26.0860535Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:26.0860778Z #define MAX_CANON 255
2025-05-07T20:26:26.0861012Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:26.0861275Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:26.0861543Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:26.0861835Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:26.0862146Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:26.0862451Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:26.0862728Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:26.0863058Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:26.0863374Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:26.0863633Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:26.0863934Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:26.0864230Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:26.0864508Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:26.0864836Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:26.0865137Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:26.0865396Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:26.0865654Z #define _SYS_TYPES_H 1
2025-05-07T20:26:26.0865905Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:26.0866166Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:26.0866441Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:26.0866684Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:26.0866961Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:26.0867259Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:26.0867519Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:26.0867820Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:26.0868090Z #define FP_SUBNORMAL 3
2025-05-07T20:26:26.0868347Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:26.0868637Z #define _INITIALIZER_LIST 
2025-05-07T20:26:26.0868887Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:26.0869154Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:26.0869456Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:26.0869717Z #define _IO_file_flags _flags
2025-05-07T20:26:26.0869982Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:26.0870239Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:26.0870521Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:26.0870802Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:26.0871074Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:26.0871448Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:26.0871849Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:26.0872167Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:26.0872441Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:26.0872700Z #define _BSD_SOURCE 1
2025-05-07T20:26:26.0872939Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:26.0873795Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:26.0874669Z #define __catch(X) catch(X)
2025-05-07T20:26:26.0874934Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:26.0875229Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:26.0875640Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:26.0875902Z #define __STRING(x) #x
2025-05-07T20:26:26.0876149Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:26.0876427Z #define _T_PTRDIFF_ 
2025-05-07T20:26:26.0876669Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:26.0876979Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:26.0877260Z #define __unbounded 
2025-05-07T20:26:26.0877501Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.0877794Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:26.0878077Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.0878378Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:26.0878660Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:26.0878961Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:26.0879377Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:26.0879691Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:26.0879980Z #define __managed__ __location__(managed)
2025-05-07T20:26:26.0880282Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:26.0880686Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:26.0881116Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:26.0881381Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:26.0881758Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:26.0882171Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:26.0882429Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:26.0882718Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:26.0883062Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:26.0883347Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:26.0883639Z #define _CRTIMP 
2025-05-07T20:26:26.0883867Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:26.0884189Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:26.0884515Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:26.0884882Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:26.0885305Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0885630Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:26.0885909Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:26.0886202Z #define __SIZE_T__ 
2025-05-07T20:26:26.0886422Z #define __stub_gtty 
2025-05-07T20:26:26.0886649Z #define __pid_t_defined 
2025-05-07T20:26:26.0886916Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:26.0887226Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.0887538Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:26.0887884Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:26.0888205Z #define __need_clockid_t 
2025-05-07T20:26:26.0888448Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:26.0888717Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:26.0889041Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:26.0889364Z #define _IO_HEX 0100
2025-05-07T20:26:26.0889623Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:26.0889964Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:26.0890063Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:26.0890177Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:26.0890402Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:26.0890520Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:26.0890633Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:26.0890733Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:26.0890839Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:26.0890949Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:26.0891033Z #define __stub_sstk 
2025-05-07T20:26:26.0891126Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:26.0891289Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:26.0891377Z #define __wur 
2025-05-07T20:26:26.0891500Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:26.0891587Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:26.0891766Z #define _IO_OCT 040
2025-05-07T20:26:26.0891866Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:26.0892068Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:26.0892160Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:26.0892295Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:26.0892387Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:26.0892489Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:26.0892686Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:26.0892781Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:26.0892878Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:26.0892987Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:26.0893075Z #define __off64_t_defined 
2025-05-07T20:26:26.0893184Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:26.0893360Z #define __FLT128_DIG__ 33
2025-05-07T20:26:26.0893466Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:26.0893569Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:26.0893660Z #define __INT32_C(c) c
2025-05-07T20:26:26.0893755Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:26.0893873Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:26.0893979Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:26.0894089Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:26.0894185Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:26.0894280Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:26.0894419Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:26.0894513Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:26.0894602Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:26.0894709Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:26.0894805Z #define __have_pthread_attr_t 1
2025-05-07T20:26:26.0894904Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:26.0895142Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:26.0895250Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:26.0895353Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:26.0895460Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:26.0895545Z #define htole32(x) (x)
2025-05-07T20:26:26.0895800Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:26.0895930Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:26.0896030Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:26.0896193Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:26.0896333Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:26.0896458Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:26.0896605Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:26.0896695Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:26.0896797Z #define cudaArrayLayered 0x01
2025-05-07T20:26:26.0896981Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:26.0897091Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:26.0897185Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:26.0897298Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:26.0897381Z #define unix 1
2025-05-07T20:26:26.0897484Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:26.0897576Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:26.0897670Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:26.0897793Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:26.0897879Z #define __USE_POSIX 1
2025-05-07T20:26:26.0897974Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:26.0898112Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:26.0898204Z #define __THROWNL throw ()
2025-05-07T20:26:26.0898295Z #define __cpp_rtti 199711L
2025-05-07T20:26:26.0898404Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:26.0898492Z #define __PMT(args) args
2025-05-07T20:26:26.0898644Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0898848Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:26.0898965Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:26.0899061Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:26.0899250Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:26.0899343Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:26.0899753Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:26.0899851Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:26.0899944Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:26.0900044Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:26.0900186Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:26.0900270Z #define _WCHAR_T_H 
2025-05-07T20:26:26.0900364Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:26.0900453Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:26.0900547Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:26.0900733Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:26.0900829Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:26.0900921Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:26.0901029Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:26.0901115Z #define __ELF__ 1
2025-05-07T20:26:26.0901220Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:26.0901318Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:26.0901408Z #define STA_INS 0x0010
2025-05-07T20:26:26.0901521Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:26.0901693Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:26.0901787Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:26.0901887Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.0901998Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0902112Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0902209Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:26.0902313Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:26.0902416Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:26.0902578Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:26.0902739Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:26.0902849Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:26.0903179Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:26.0903317Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:26.0903411Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:26.0903497Z #define __FLT_RADIX__ 2
2025-05-07T20:26:26.0903603Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:26.0903772Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:26.0903866Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:26.0903965Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:26.0904086Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:26.0904190Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:26.0904313Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:26.0904420Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:26.0904503Z #define WORD_BIT 32
2025-05-07T20:26:26.0904596Z #define _IO_USER_BUF 1
2025-05-07T20:26:26.0904692Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:26.0904803Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0904912Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:26.0905010Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:26.0905117Z #define __long_double_t long double
2025-05-07T20:26:26.0905212Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:26.0905304Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:26.0905717Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:26.0905801Z #define __k8 1
2025-05-07T20:26:26.0905998Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:26.0906563Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:26.0906703Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:26.0906811Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:26.0906910Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:26.0907248Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:26.0907353Z #define __blksize_t_defined 
2025-05-07T20:26:26.0907446Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:26.0907545Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:26.0907669Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:26.0907762Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:26.0907869Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:26.0907972Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:26.0908066Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:26.0908324Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:26.0908681Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:26.0908782Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:26.0909024Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:26.0909108Z #define SEEK_SET 0
2025-05-07T20:26:26.0909206Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:26.0909316Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:26:26.0909512Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:26.0909615Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:26.0909716Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:26.0909806Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:26.0910131Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:26.0910239Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:26.0910336Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:26.0910433Z #define __stub_sigreturn 
2025-05-07T20:26:26.0910675Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:26.0910780Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:26.0910878Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:26.0910975Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:26.0911060Z #define CLOCK_TAI 11
2025-05-07T20:26:26.0911179Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:26.0911391Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:26.0911478Z #define __restrict_arr 
2025-05-07T20:26:26.0911595Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:26.0911736Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:26.0912282Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:26.0912469Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:26.0912555Z #define __USE_MISC 1
2025-05-07T20:26:26.0912667Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:26.0912769Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:26.0912857Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:26.0912950Z #define __LDBL_DIG__ 18
2025-05-07T20:26:26.0913050Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:26.0913159Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:26.0913250Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:26.0913351Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:26.0913439Z #define __x86_64__ 1
2025-05-07T20:26:26.0913520Z #define _SIZE_T_ 
2025-05-07T20:26:26.0914430Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:26.0914538Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:26.0914636Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:26.0914758Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:26.0914881Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:26.0915107Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:26.0915224Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:26.0915346Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:26.0915486Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:26.0915589Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:26.0916066Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:26.0916198Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:26.0916347Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:26.0916449Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:26.0916551Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:26.0916719Z #define STA_FLL 0x0008
2025-05-07T20:26:26.0916862Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:26.0916965Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:26.0917092Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0917204Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:26.0917293Z #define __stub_revoke 
2025-05-07T20:26:26.0917383Z #define __timer_t_defined 1
2025-05-07T20:26:26.0917516Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:26.0917612Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:26.0917717Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:26.0917828Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:26.0917922Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:26.0918024Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:26.0918138Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:26.0918238Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:26.0918393Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:26.0918493Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:26.0918582Z #define _IO_off_t __off_t
2025-05-07T20:26:26.0918668Z #define __FLT64_DIG__ 15
2025-05-07T20:26:26.0918903Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:26.0918999Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:26.0919134Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0919258Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:26.0919354Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:26.0919461Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:26.0919544Z #define NULL __null
2025-05-07T20:26:26.0919700Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:26.0919810Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:26.0919912Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:26.0920008Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0931729Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:26.0931869Z #define FP_ZERO 2
2025-05-07T20:26:26.0932085Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:26.0932248Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:26.0932364Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0932458Z #define __WCHAR_T__ 
2025-05-07T20:26:26.0932556Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:26.0932763Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:26.0932914Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:26.0933012Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:26.0933139Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:26.0933254Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:26.0933384Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:26.0933517Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:26.0933608Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:26.0933702Z #define _SIGSET_H_types 1
2025-05-07T20:26:26.0933820Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:26.0933925Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:26.0934238Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:26.0934344Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:26.0934465Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:26.0934604Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:26.0934712Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:26.0934841Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:26.0934960Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:26:26.0935135Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:26.0935231Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:26.0935341Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:26.0935443Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:26.0935538Z #define STA_MODE 0x4000
2025-05-07T20:26:26.0935738Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:26.0935840Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:26.0935960Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:26.0936071Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:26.0936166Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:26.0936277Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:26.0936372Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:26.0936486Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:26.0936582Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:26.0936698Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.0936780Z #define __SEG_FS 1
2025-05-07T20:26:26.0936877Z #define _IO_size_t size_t
2025-05-07T20:26:26.0936974Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:26.0937077Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:26.0937163Z #define __stub_lchmod 
2025-05-07T20:26:26.0937255Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:26.0937377Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0937473Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:26.0937556Z #define __SEG_GS 1
2025-05-07T20:26:26.0937753Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:26.0937842Z #define _IOS_APPEND 8
2025-05-07T20:26:26.0937936Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:26.0938036Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:26.0938133Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:26.0938230Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:26.0938336Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:26.0938422Z #define htole16(x) (x)
2025-05-07T20:26:26.0938537Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:26.0938631Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:26.0938725Z #define __INT16_TYPE__ short int
2025-05-07T20:26:26.0938833Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:26.0938941Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:26.0939056Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:26.0939190Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:26.0939281Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:26.0939376Z #define __WCLONE 0x80000000
2025-05-07T20:26:26.0939476Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:26.0939560Z #define SEEK_HOLE 4
2025-05-07T20:26:26.0939654Z #define TIMER_ABSTIME 1
2025-05-07T20:26:26.0939748Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:26.0939840Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:26.0940022Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:26.0940135Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0940230Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:26.0940344Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:26.0940441Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0940562Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:26.0940663Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:26.0940749Z #define linux 1
2025-05-07T20:26:26.0940844Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:26.0940963Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:26.0941146Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:26.0941250Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:26.0941357Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:26.0941503Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:26.0941606Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:26.0941703Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0941801Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:26.0941896Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:26.0941988Z #define htole64(x) (x)
2025-05-07T20:26:26.0942088Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:26.0942220Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:26.0942314Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:26.0942822Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:26.0942992Z #define __USE_POSIX2 1
2025-05-07T20:26:26.0943090Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:26.0943192Z #define __WALL 0x40000000
2025-05-07T20:26:26.0943288Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:26.0943373Z #define _XLOCALE_H 1
2025-05-07T20:26:26.0943473Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:26.0943568Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:26.0943662Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:26.0943771Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:26.0943859Z #define __EXCEPTIONS 1
2025-05-07T20:26:26.0943957Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:26.0944159Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:26.0944247Z #define __WORDSIZE 64
2025-05-07T20:26:26.0944345Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:26.0944434Z #define _STL_RELOPS_H 1
2025-05-07T20:26:26.0944531Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:26.0944638Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:26.0944733Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:26.0944824Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:26.0944934Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:26.0945238Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:26.0945473Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:26.0945612Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:26.0945709Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:26.0945819Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:26.0945931Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:26.0946030Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:26.0946143Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:26.0946326Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:26.0946429Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:26.0946528Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:26.0946631Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:26.0946814Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:26.0946936Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:26.0947020Z #define _STRING_H 1
2025-05-07T20:26:26.0947120Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:26.0947214Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:26.0947310Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:26.0947449Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:26.0947542Z #define __code_model_small__ 1
2025-05-07T20:26:26.0947630Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:26.0947736Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:26.0947848Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:26.0947943Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:26.0948051Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:26.0948404Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:26.0948498Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:26.0948706Z #define le64toh(x) (x)
2025-05-07T20:26:26.0948797Z #define FILENAME_MAX 4096
2025-05-07T20:26:26.0948955Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:26.0949069Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:26.0949153Z #define L_cuserid 9
2025-05-07T20:26:26.0949244Z #define __ino_t_defined 
2025-05-07T20:26:26.0949324Z #define __k8__ 1
2025-05-07T20:26:26.0949420Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:26.0949534Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:26.0949626Z #define __int8_t_defined 
2025-05-07T20:26:26.0949718Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:26.0949823Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:26.0949936Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:26.0950222Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:26.0950347Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:26.0950496Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:26.0950594Z #define __HAVE_COLUMN 
2025-05-07T20:26:26.0950681Z #define __stub_fdetach 
2025-05-07T20:26:26.0951102Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:26.0951190Z #define __pic__ 2
2025-05-07T20:26:26.0951309Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.0951404Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:26.0951502Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:26.0951602Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:26.0951688Z #define __stub_chflags 
2025-05-07T20:26:26.0951781Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:26.0951865Z #define __need_IOV_MAX 
2025-05-07T20:26:26.0951978Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:26.0952086Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:26.0952182Z #define __cpp_decltype 200707L
2025-05-07T20:26:26.0952285Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:26.0952380Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:26.0952487Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:26.0952578Z #define TTY_NAME_MAX 32
2025-05-07T20:26:26.0952747Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:26.0952869Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0953043Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:26.0953152Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:26.0953250Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:26.0953343Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:26.0953424Z #define __import__ 
2025-05-07T20:26:26.0953517Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:26.0953652Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:26.0953744Z #define __export__ 
2025-05-07T20:26:26.0953889Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:26.0954007Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:26.0954181Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:26.0954283Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:26.0954372Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:26.0954466Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:26.0954563Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:26.0954682Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:26.0954807Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:26.0954911Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:26.0955000Z #define WNOWAIT 0x01000000
2025-05-07T20:26:26.0955088Z #define PLOSS 6
2025-05-07T20:26:26.0955181Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:26.0955447Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:26.0955546Z #define EXIT_SUCCESS 0
2025-05-07T20:26:26.0955641Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:26.0955736Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:26.0955843Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:26.0956019Z #define __thread__ __thread
2025-05-07T20:26:26.0956127Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:26.0956219Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:26.0956321Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:26.0956558Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:26.0956671Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:26.0956764Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:26.0956853Z #define __linux__ 1
2025-05-07T20:26:26.0956948Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:26.0957074Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:26.0957172Z #define __S16_TYPE short int
2025-05-07T20:26:26.0957529Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:26.0957718Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:26.0957916Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:26.0958013Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:26.0958114Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:26.0958196Z #define _T_SIZE_ 
2025-05-07T20:26:26.0958297Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:26.0958423Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:26.0958519Z #define _PSTL_VERSION 12000
2025-05-07T20:26:26.0958641Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:26.0958746Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:26.0958841Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:26.0958970Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:26.0959059Z #define _IOS_INPUT 1
2025-05-07T20:26:26.0959150Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:26.0959263Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:26.0959356Z #define __INT64_TYPE__ long int
2025-05-07T20:26:26.0959451Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:26.0959555Z #define __shared__ __location__(shared)
2025-05-07T20:26:26.0959650Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:26.0959805Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:26.0959899Z #define __gid_t_defined 
2025-05-07T20:26:26.0960012Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:26.0960108Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:26.0960313Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:26.0960408Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:26.0960505Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:26.0960591Z #define ___int_size_t_h 
2025-05-07T20:26:26.0960696Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.0960824Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:26.0960980Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:26.0961087Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:26.0961192Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:26.0961294Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:26.0961391Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:26.0961521Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0961633Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:26.0961752Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:26.0961850Z #define __clock_t_defined 1
2025-05-07T20:26:26.0961948Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:26.0962066Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:26.0962155Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:26.0962246Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:26.0962347Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:26.0962456Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:26.0962545Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:26.0962729Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:26.0962809Z #define __SSE__ 1
2025-05-07T20:26:26.0962903Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:26.0963096Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:26.0963181Z #define _CTYPE_H 1
2025-05-07T20:26:26.0963278Z #define __sigset_t_defined 
2025-05-07T20:26:26.0963372Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:26.0963466Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:26.0963558Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:26.0963652Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:26.0963744Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:26.0963832Z #define __SM_70_RT_H__ 
2025-05-07T20:26:26.0963924Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:26.0964029Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:26.0964141Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:26.0964333Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:26.0964515Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:26.0964634Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:26.0964733Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:26.0964829Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:26.0964916Z #define __amd64__ 1
2025-05-07T20:26:26.0965007Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:26.0965118Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:26.0965390Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.0965489Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:26.0965576Z #define EOF (-1)
2025-05-07T20:26:26.0965672Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:26.0965766Z #define __USE_POSIX199309 1
2025-05-07T20:26:26.0965867Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:26.0965963Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:26.0966060Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:26.0966162Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:26.0966280Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:26.0966387Z #define ____mbstate_t_defined 1
2025-05-07T20:26:26.0966477Z #define STA_NANO 0x2000
2025-05-07T20:26:26.0966574Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:26.0966681Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:26.0966768Z #define _IO_LINKED 0x80
2025-05-07T20:26:26.0966865Z #define __cpp_lib_launder 201606
2025-05-07T20:26:26.0966967Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:26.0967069Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:26.0967163Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:26.0967265Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:26.0967408Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:26.0967516Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0967623Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:26.0967718Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:26.0967820Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:26.0967911Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:26.0968047Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:26.0968177Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.0968385Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:26.0968574Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:26.0968665Z #define __stub_stty 
2025-05-07T20:26:26.0968831Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:26.0968918Z #define le16toh(x) (x)
2025-05-07T20:26:26.0969036Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:26.0969210Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:26.0969298Z #define _SIZET_ 
2025-05-07T20:26:26.0969393Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:26.0969480Z #define _SVID_SOURCE 1
2025-05-07T20:26:26.0969565Z #define _LP64 1
2025-05-07T20:26:26.0969653Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:26.0969894Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:26.0970016Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:26.0970101Z #define __UINT8_C(c) c
2025-05-07T20:26:26.0970281Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:26.0970382Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:26.0970494Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:26.0970587Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:26.0970685Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:26.0970782Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:26.0970873Z #define CUDARTAPI 
2025-05-07T20:26:26.0970958Z #define IOV_MAX 1024
2025-05-07T20:26:26.0971104Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:26.0971209Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:26.0971306Z #define P_tmpdir "/tmp"
2025-05-07T20:26:26.0971408Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:26.0971496Z #define __wchar_t__ 
2025-05-07T20:26:26.0971600Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:26.0971763Z #define SEEK_END 2
2025-05-07T20:26:26.0971859Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:26.0972167Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:26.0972281Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:26.0972434Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:26.0972526Z #define ____FILE_defined 1
2025-05-07T20:26:26.0972653Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:26.0972751Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:26.0972840Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:26.0972945Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:26.0973199Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.0973331Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:26.0973420Z #define _IO_RIGHT 04
2025-05-07T20:26:26.0973515Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:26.0973703Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:26.0973807Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:26.0973930Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:26.0974029Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:26.0974133Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:26.0974217Z #define _STDDEF_H_ 
2025-05-07T20:26:26.0974399Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:26.0974497Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0974615Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:26.0974823Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:26.0974936Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.0975080Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:26.0975208Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:26.0975310Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:26.0975424Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:26.0975524Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:26.0975637Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:26.0975739Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:26.0975836Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:26.0975930Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:26.0976110Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:26.0976215Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:26.0976394Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:26.0976500Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:26.0976595Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:26.0976747Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:26.0976843Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:26.0976936Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:26.0977040Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:26.0977160Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:26.0977259Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:26.0977366Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:26.0977629Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:26.0977807Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:26.0977912Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:26.0978033Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:26.0978153Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:26.0978255Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:26.0978490Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:26.0978597Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:26.0978710Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:26.0978808Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:26.0978903Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:26.0978995Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:26.0979204Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:26.0979306Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:26.0979388Z #define __FXSR__ 1
2025-05-07T20:26:26.0979472Z #define _SIZE_T 
2025-05-07T20:26:26.0979588Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:26.0979700Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:26.0979879Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:26.0980028Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:26.0980122Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:26.0980226Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:26.0980413Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:26.0980616Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:26.0980716Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:26.0980840Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:26.0981014Z #define FOPEN_MAX 16
2025-05-07T20:26:26.0981174Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:26.0981333Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:26.0981498Z #define __suseconds_t_defined 
2025-05-07T20:26:26.0981859Z #define __off_t_defined 
2025-05-07T20:26:26.0981996Z #define stderr stderr
2025-05-07T20:26:26.0982122Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:26.0982303Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:26.0982432Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:26.0982556Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:26.0983101Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:26.0983239Z #define __mode_t_defined 
2025-05-07T20:26:26.0983392Z #define _GCC_SIZE_T 
2025-05-07T20:26:26.0983521Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.0983655Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:26.0983838Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:26.0984017Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:26.0984245Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:26.0984418Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:26.0984561Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:26.0984697Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:26.0984862Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:26.0985027Z #define __size_t__ 
2025-05-07T20:26:26.0985241Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:26.0985368Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:26.0985510Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:26.0985751Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:26.0985862Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:26.0986110Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:26.0986279Z #define _ENDIAN_H 1
2025-05-07T20:26:26.0986417Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:26.0986608Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:26.0986740Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:26.0986837Z #define __try try
2025-05-07T20:26:26.0987070Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:26.0987287Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:26.0987410Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:26.0987763Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:26.0987889Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:26.0988069Z #define __PIC__ 2
2025-05-07T20:26:26.0988311Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:26.0988463Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:26.0988687Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:26.0988818Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:26.0988943Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:26.0989229Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:26.0989466Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.0989623Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:26.0989782Z #define _IO_uid_t __uid_t
2025-05-07T20:26:26.0989912Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:26.0990080Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:26.0990283Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:26.0990499Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:26.0990675Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:26.0990829Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:26.0990946Z #define LONG_BIT 64
2025-05-07T20:26:26.0991108Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:26.0991303Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:26.0991501Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:26.0991666Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:26.0991789Z #define __blkcnt_t_defined 
2025-05-07T20:26:26.0992174Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:26.0992324Z #define __USE_LARGEFILE 1
2025-05-07T20:26:26.0992505Z #define __cpp_constexpr 201603L
2025-05-07T20:26:26.0992706Z #define CUDART_VERSION 12080
2025-05-07T20:26:26.0992833Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:26.0992966Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:26.0993120Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:26.0993338Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:26.0993533Z #define __lldiv_t_defined 1
2025-05-07T20:26:26.0993701Z #define __SSE2__ 1
2025-05-07T20:26:26.0993816Z #define _IOLBF 1
2025-05-07T20:26:26.0993970Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:26.0994155Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:26.0994275Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:26.0994528Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:26.0994669Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:26.0994789Z #define __INT32_TYPE__ int
2025-05-07T20:26:26.0994954Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:26.0995094Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:26.0995235Z #define __cpp_exceptions 199711L
2025-05-07T20:26:26.0995466Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:26.0995607Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:26.0995732Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:26.0996000Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:26.0996216Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:26.0996417Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:26.0996560Z #define __SWORD_TYPE long int
2025-05-07T20:26:26.0996684Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:26.0996847Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:26.0996973Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:26.0997118Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:26.0997518Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:26.0997667Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:26.0997885Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:26.0997998Z #define _T_SIZE 
2025-05-07T20:26:26.0998247Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:26.0998427Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:26.0998635Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:26.0998777Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:26.0998933Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:26.0999111Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:26.0999240Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.0999384Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:26.0999641Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:26.0999817Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:26.1000050Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:26.1000175Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:26.1000362Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.1000545Z #define __PIE__ 2
2025-05-07T20:26:26.1000725Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:26.1000927Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:26.1001159Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:26.1001413Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:26.1001573Z #define __nlink_t_defined 
2025-05-07T20:26:26.1001717Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:26.1001985Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:26.1002105Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:26.1002404Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:26.1002592Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:26.1002728Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:26.1002843Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:26.1003099Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:26.1003221Z #define __FILE_defined 1
2025-05-07T20:26:26.1003471Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:26.1003603Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:26.1003728Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:26.1003941Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:26.1004141Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:26.1004380Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:26.1004551Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:26.1004666Z #define __INT16_C(c) c
2025-05-07T20:26:26.1004792Z #define __U32_TYPE unsigned int
2025-05-07T20:26:26.1005018Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:26.1005188Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:26.1005336Z #define __STDC__ 1
2025-05-07T20:26:26.1005462Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:26.1005591Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:26.1005743Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:26.1005999Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:26.1006485Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:26.1006739Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:26.1006905Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:26.1007090Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:26.1007324Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:26.1007571Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:26.1007795Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:26.1007909Z #define stdin stdin
2025-05-07T20:26:26.1008029Z #define __ino64_t_defined 
2025-05-07T20:26:26.1008209Z #define STA_CLK 0x8000
2025-05-07T20:26:26.1008342Z #define __clockid_t_defined 1
2025-05-07T20:26:26.1008579Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:26.1008901Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:26.1009042Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:26.1009178Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:26.1009377Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:26.1009743Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:26.1010098Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:26.1010223Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:26.1010804Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:26.1010981Z #define DOMAIN 1
2025-05-07T20:26:26.1011102Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:26.1011307Z #define __NVCC__ 1
2025-05-07T20:26:26.1011456Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:26.1011600Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.1012022Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:26.1012162Z #define __throw_exception_again throw
2025-05-07T20:26:26.1012285Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:26.1012495Z #define __EXCEPTION_H 1
2025-05-07T20:26:26.1012644Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.1012800Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:26.1013179Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:26.1013323Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:26.1013549Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:26.1013725Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:26.1013873Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:26.1014058Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:26.1014233Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:26.1014371Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.1014533Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:26.1014715Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:26.1014887Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:26.1015053Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.1015189Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:26.1015396Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:26.1015509Z #define __useconds_t_defined 
2025-05-07T20:26:26.1015687Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:26.1015977Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:26.1016159Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:26.1016277Z #define __SSE_MATH__ 1
2025-05-07T20:26:26.1016438Z #define _IO_wint_t wint_t
2025-05-07T20:26:26.1016549Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:26.1016793Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:26.1016917Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:26.1017063Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:26.1017232Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:26.1017432Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:26.1017534Z #define __USE_ATFILE 1
2025-05-07T20:26:26.1017775Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:26.1017905Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:26.1018023Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:26.1018323Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:26.1018450Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:26.1018670Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:26.1018820Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:26.1018960Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:26.1019110Z #define _STDLIB_H 1
2025-05-07T20:26:26.1019280Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:26.1019410Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.1019622Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:26.1019797Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.1019940Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:26.1020102Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:26.1020318Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:26.1020640Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:26.1020828Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:26.1020994Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:26.1021152Z #define __ldiv_t_defined 1
2025-05-07T20:26:26.1021364Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.1021584Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:26.1021807Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:26.1021991Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:26.1022163Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:26.1022298Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:26.1022451Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.1022697Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:26.1022795Z #define CUDART_CB 
2025-05-07T20:26:26.1022990Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:26.1023207Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:26.1023348Z #define MB_LEN_MAX 16
2025-05-07T20:26:26.1023610Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:26.1023788Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:26.1023949Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:26.1024222Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:26.1024373Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:26.1024552Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:26.1024725Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:26.1024842Z #define _GNU_SOURCE 1
2025-05-07T20:26:26.1024945Z #define __stub_putmsg 
2025-05-07T20:26:26.1025158Z #define __CUDACC__ 1
2025-05-07T20:26:26.1025299Z #define __N(msgid) (msgid)
2025-05-07T20:26:26.1025497Z #define __P(args) args
2025-05-07T20:26:26.1025822Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:26.1025956Z #define __cpp_init_captures 201304L
2025-05-07T20:26:26.1026165Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:26.1026323Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:26.1026452Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:26.1026597Z #define __WCHAR_T 
2025-05-07T20:26:26.1026723Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:26.1026847Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:26.1027080Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:26.1027250Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:26.1027256Z 
2025-05-07T20:26:26.1214394Z 
2025-05-07T20:26:26.1215168Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:26.1215184Z 
2025-05-07T20:26:28.0072181Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:28.0072810Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:26:28.0073497Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:26:28.0074179Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:26:28.0074811Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:26:28.0075175Z 
2025-05-07T20:26:28.0711463Z 
2025-05-07T20:26:28.0721674Z /usr/bin/nvidia-smi
2025-05-07T20:26:28.0726716Z + nvidia-smi
2025-05-07T20:26:28.0726912Z 
2025-05-07T20:26:28.0904158Z Wed May  7 20:26:28 2025       
2025-05-07T20:26:28.0904643Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.0905324Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:28.0905911Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:28.0907495Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:28.0908364Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:28.0908960Z |                                         |                        |               MIG M. |
2025-05-07T20:26:28.0909697Z |=========================================+========================+======================|
2025-05-07T20:26:28.1071580Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:28.1072156Z |  0%   28C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:28.1072588Z |                                         |                        |                  N/A |
2025-05-07T20:26:28.1073185Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:28.1075591Z                                                                                          
2025-05-07T20:26:28.1082698Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.1083598Z | Processes:                                                                              |
2025-05-07T20:26:28.1084098Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:28.1084518Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:28.1084864Z |=========================================================================================|
2025-05-07T20:26:28.1085308Z |  No running processes found                                                             |
2025-05-07T20:26:28.1085790Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.3819881Z 
2025-05-07T20:26:28.3824473Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:26:28.3872026Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:26:28.3872792Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:26:28.3884619Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:28.3884966Z env:
2025-05-07T20:26:28.3885189Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:28.3885485Z   BUILD_ENV: build_binary
2025-05-07T20:26:28.3885728Z   BUILD_TARGET: genai
2025-05-07T20:26:28.3885958Z   BUILD_VARIANT: cuda
2025-05-07T20:26:28.3886183Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:26:28.3886437Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:28.3886738Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:28.3887068Z ##[endgroup]
2025-05-07T20:26:28.7270328Z ################################################################################
2025-05-07T20:26:28.7270701Z # Install PyTorch (PIP)
2025-05-07T20:26:28.7270935Z #
2025-05-07T20:26:28.7285997Z # [2025-05-07T20:26:28.728Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:26:28.7286720Z ################################################################################
2025-05-07T20:26:28.7287091Z 
2025-05-07T20:26:28.7314137Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:29.7094854Z Channels:
2025-05-07T20:26:29.7095144Z  - conda-forge
2025-05-07T20:26:29.7095407Z Platform: linux-64
2025-05-07T20:26:33.0904402Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:33.8019505Z Solving environment: \ | / done
2025-05-07T20:26:34.0198397Z 
2025-05-07T20:26:34.0198706Z ## Package Plan ##
2025-05-07T20:26:34.0198877Z 
2025-05-07T20:26:34.0199091Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:34.0199410Z 
2025-05-07T20:26:34.0199505Z   added / updated specs:
2025-05-07T20:26:34.0199759Z     - numpy
2025-05-07T20:26:34.0199877Z 
2025-05-07T20:26:34.0199908Z 
2025-05-07T20:26:34.0200040Z The following packages will be downloaded:
2025-05-07T20:26:34.0200257Z 
2025-05-07T20:26:34.0200372Z     package                    |            build
2025-05-07T20:26:34.0200712Z     ---------------------------|-----------------
2025-05-07T20:26:34.0201114Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:34.0201576Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:34.0202047Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:34.0202513Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:34.0202984Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:34.0203471Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:34.0203940Z     numpy-2.2.5                |  py312h72c5963_0         8.1 MB  conda-forge
2025-05-07T20:26:34.0204342Z     ------------------------------------------------------------
2025-05-07T20:26:34.0204700Z                                            Total:        15.4 MB
2025-05-07T20:26:34.0205238Z 
2025-05-07T20:26:34.0205373Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:34.0205603Z 
2025-05-07T20:26:34.0205827Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:34.0206581Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:34.0207102Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:34.0207616Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:34.0208148Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:34.0208715Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:34.0209461Z   numpy              conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 
2025-05-07T20:26:34.0209749Z 
2025-05-07T20:26:34.0209753Z 
2025-05-07T20:26:34.0209757Z 
2025-05-07T20:26:34.0209910Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:34.0210297Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:26:34.0210526Z 
2025-05-07T20:26:34.0212649Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.0212907Z 
2025-05-07T20:26:34.0212911Z 
2025-05-07T20:26:34.0225018Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:34.0225285Z 
2025-05-07T20:26:34.0225289Z 
2025-05-07T20:26:34.0225293Z 
2025-05-07T20:26:34.0241249Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:34.0241520Z 
2025-05-07T20:26:34.0241524Z 
2025-05-07T20:26:34.0241528Z 
2025-05-07T20:26:34.0241531Z 
2025-05-07T20:26:34.0267898Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:34.0268208Z 
2025-05-07T20:26:34.0268225Z 
2025-05-07T20:26:34.0268229Z 
2025-05-07T20:26:34.0268232Z 
2025-05-07T20:26:34.0268675Z 
2025-05-07T20:26:34.0270441Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:34.0270816Z 
2025-05-07T20:26:34.0270821Z 
2025-05-07T20:26:34.0270825Z 
2025-05-07T20:26:34.0270829Z 
2025-05-07T20:26:34.0270832Z 
2025-05-07T20:26:34.0270836Z 
2025-05-07T20:26:34.0896312Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:34.0896619Z 
2025-05-07T20:26:34.0896624Z 
2025-05-07T20:26:34.0896628Z 
2025-05-07T20:26:34.0902520Z 
2025-05-07T20:26:34.1224226Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.1224531Z 
2025-05-07T20:26:34.1227460Z 
2025-05-07T20:26:34.1885228Z libgfortran5-15.1.0  | 1.5 MB    | ##1        |  22% [A[A
2025-05-07T20:26:34.1885528Z 
2025-05-07T20:26:34.1885533Z 
2025-05-07T20:26:34.1885538Z 
2025-05-07T20:26:34.1885544Z 
2025-05-07T20:26:34.1885554Z 
2025-05-07T20:26:34.2363828Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:34.2364173Z 
2025-05-07T20:26:34.2364177Z 
2025-05-07T20:26:34.2364182Z 
2025-05-07T20:26:34.2364187Z 
2025-05-07T20:26:34.2364191Z 
2025-05-07T20:26:34.2466528Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.2466805Z 
2025-05-07T20:26:34.2473211Z 
2025-05-07T20:26:34.3072075Z libgfortran5-15.1.0  | 1.5 MB    | ####3      |  44% [A[A
2025-05-07T20:26:34.3072402Z 
2025-05-07T20:26:34.3072407Z 
2025-05-07T20:26:34.3072410Z 
2025-05-07T20:26:34.3072415Z 
2025-05-07T20:26:34.3072420Z 
2025-05-07T20:26:34.3123014Z 
2025-05-07T20:26:34.3227891Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:34.3234510Z 
2025-05-07T20:26:34.3242699Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.3243044Z 
2025-05-07T20:26:34.3243050Z 
2025-05-07T20:26:34.3243055Z 
2025-05-07T20:26:34.3243060Z 
2025-05-07T20:26:34.3243065Z 
2025-05-07T20:26:34.3243099Z 
2025-05-07T20:26:34.3253985Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.3254361Z 
2025-05-07T20:26:34.3254367Z 
2025-05-07T20:26:34.3254372Z 
2025-05-07T20:26:34.3320255Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:26:34.3320694Z 
2025-05-07T20:26:34.3320701Z 
2025-05-07T20:26:34.3333623Z 
2025-05-07T20:26:34.3712845Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.3713145Z 
2025-05-07T20:26:34.3713149Z 
2025-05-07T20:26:34.3715427Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.3715700Z 
2025-05-07T20:26:34.3715705Z 
2025-05-07T20:26:34.3762605Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.3767140Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:26:34.3767503Z 
2025-05-07T20:26:34.3767510Z 
2025-05-07T20:26:34.3767515Z 
2025-05-07T20:26:34.3767705Z 
2025-05-07T20:26:34.3787571Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.3787852Z 
2025-05-07T20:26:34.3787856Z 
2025-05-07T20:26:34.3787860Z 
2025-05-07T20:26:34.3789524Z 
2025-05-07T20:26:34.3956349Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.3956637Z 
2025-05-07T20:26:34.3956641Z 
2025-05-07T20:26:34.3956645Z 
2025-05-07T20:26:34.3956649Z 
2025-05-07T20:26:34.3956652Z 
2025-05-07T20:26:34.4074812Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.4075084Z 
2025-05-07T20:26:34.4075089Z 
2025-05-07T20:26:34.4075092Z 
2025-05-07T20:26:34.4075096Z 
2025-05-07T20:26:34.4075100Z 
2025-05-07T20:26:34.4077655Z 
2025-05-07T20:26:34.4230126Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.4231978Z 
2025-05-07T20:26:34.4777013Z libopenblas-0.3.29   | 5.6 MB    | #####3     |  53% [A
2025-05-07T20:26:34.4798308Z numpy-2.2.5          | 8.1 MB    | ##8        |  29% 
2025-05-07T20:26:34.4798625Z 
2025-05-07T20:26:34.4798656Z 
2025-05-07T20:26:34.4799743Z 
2025-05-07T20:26:34.4803975Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.4804250Z 
2025-05-07T20:26:34.4804255Z 
2025-05-07T20:26:34.4805561Z 
2025-05-07T20:26:34.4876806Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.4877113Z 
2025-05-07T20:26:34.5203395Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.5203775Z 
2025-05-07T20:26:34.5203780Z 
2025-05-07T20:26:34.5782292Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.5907873Z numpy-2.2.5          | 8.1 MB    | ####7      |  48% 
2025-05-07T20:26:34.5908200Z 
2025-05-07T20:26:34.5909042Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.5909400Z 
2025-05-07T20:26:34.6322673Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.0017412Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:26:35.0017975Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:26:35.0025465Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:26:35.0025953Z                                                      
2025-05-07T20:26:35.0026257Z 
2025-05-07T20:26:35.0026673Z                                                      [A
2025-05-07T20:26:35.0026936Z 
2025-05-07T20:26:35.0026940Z 
2025-05-07T20:26:35.0027111Z                                                      [A[A
2025-05-07T20:26:35.0027311Z 
2025-05-07T20:26:35.0027315Z 
2025-05-07T20:26:35.0027319Z 
2025-05-07T20:26:35.0027492Z                                                      [A[A[A
2025-05-07T20:26:35.0027693Z 
2025-05-07T20:26:35.0027697Z 
2025-05-07T20:26:35.0027701Z 
2025-05-07T20:26:35.0027705Z 
2025-05-07T20:26:35.0027912Z                                                      [A[A[A[A
2025-05-07T20:26:35.0028159Z 
2025-05-07T20:26:35.0028163Z 
2025-05-07T20:26:35.0028167Z 
2025-05-07T20:26:35.0028171Z 
2025-05-07T20:26:35.0028181Z 
2025-05-07T20:26:35.0028372Z                                                      [A[A[A[A[A
2025-05-07T20:26:35.0028580Z 
2025-05-07T20:26:35.0028584Z 
2025-05-07T20:26:35.0028588Z 
2025-05-07T20:26:35.0028826Z 
2025-05-07T20:26:35.0028831Z 
2025-05-07T20:26:35.0028835Z 
2025-05-07T20:26:35.0029022Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:35.1033139Z Preparing transaction: \ done
2025-05-07T20:26:35.2036120Z Verifying transaction: / done
2025-05-07T20:26:35.3044859Z Executing transaction: \ done
2025-05-07T20:26:35.4845589Z ################################################################################
2025-05-07T20:26:35.4845994Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:35.4846287Z #
2025-05-07T20:26:35.4862941Z # [2025-05-07T20:26:35.485Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:26:35.4863441Z ################################################################################
2025-05-07T20:26:35.4864136Z 
2025-05-07T20:26:35.4878514Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:35.5801291Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:35.5801772Z ################################################################################
2025-05-07T20:26:35.5802186Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:35.5802518Z #
2025-05-07T20:26:35.5819584Z # [2025-05-07T20:26:35.581Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:26:35.5820131Z ################################################################################
2025-05-07T20:26:35.5820351Z 
2025-05-07T20:26:35.5841049Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:35.5869498Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:26:35.5886454Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:35.5887088Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:26:35.5895419Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:35.5904384Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:26:35.5926744Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:12.3390805Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:12.3391434Z Collecting torch
2025-05-07T20:28:12.3392508Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:12.3393670Z Collecting filelock (from torch)
2025-05-07T20:28:12.3394407Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:12.3395920Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2)
2025-05-07T20:28:12.3397565Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1)
2025-05-07T20:28:12.3398591Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:12.3399335Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:12.3400558Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 51.0 MB/s eta 0:00:00
2025-05-07T20:28:12.3401072Z Collecting networkx (from torch)
2025-05-07T20:28:12.3401794Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:12.3402738Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.5 MB/s eta 0:00:00
2025-05-07T20:28:12.3403240Z Collecting jinja2 (from torch)
2025-05-07T20:28:12.3403995Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:12.3404752Z Collecting fsspec (from torch)
2025-05-07T20:28:12.3405473Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:12.3406624Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:28:12.3408495Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:12.3409733Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:28:12.3411004Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:12.3412339Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:28:12.3413554Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:12.3414964Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:28:12.3416000Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:28:12.3417097Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:28:12.3418185Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:12.3419261Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:28:12.3420464Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:12.3421663Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:28:12.3422743Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:12.3423850Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:28:12.3424966Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:12.3426164Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:28:12.3427398Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:12.3428638Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:12.3429737Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:28:12.3430832Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:12.3432007Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:12.3433214Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:28:12.3434410Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:12.3435648Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:28:12.3436892Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:12.3438123Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:28:12.3439333Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:12.3440584Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:12.3441848Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:12.3443100Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:12.3444150Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:12.3445245Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 4.9 MB/s eta 0:00:00
2025-05-07T20:28:12.3445790Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:12.3446856Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:28:12.3448497Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl (1047.0 MB)
2025-05-07T20:28:12.3449722Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 21.0 MB/s eta 0:00:00
2025-05-07T20:28:12.3450938Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:28:12.3452248Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 53.0 MB/s eta 0:00:00
2025-05-07T20:28:12.3453460Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:28:12.3454774Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 169.6 MB/s eta 0:00:00
2025-05-07T20:28:12.3455985Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:28:12.3457303Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 144.8 MB/s eta 0:00:00
2025-05-07T20:28:12.3458531Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:28:12.3459864Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 93.5 MB/s eta 0:00:00
2025-05-07T20:28:12.3460890Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:28:12.3462080Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 44.4 MB/s eta 0:00:00
2025-05-07T20:28:12.3463228Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:28:12.3464485Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 105.8 MB/s eta 0:00:00
2025-05-07T20:28:12.3465675Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:28:12.3466933Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 100.6 MB/s eta 0:00:00
2025-05-07T20:28:12.3467975Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:28:12.3469152Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 146.4 MB/s eta 0:00:00
2025-05-07T20:28:12.3470219Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:28:12.3471670Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 127.9 MB/s eta 0:00:00
2025-05-07T20:28:12.3472890Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:28:12.3474293Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 121.3 MB/s eta 0:00:00
2025-05-07T20:28:12.3475334Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:12.3476499Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.4 MB/s eta 0:00:00
2025-05-07T20:28:12.3477804Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:12.3479064Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.6 MB/s eta 0:00:00
2025-05-07T20:28:12.3480219Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:28:12.3481483Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 153.7 MB/s eta 0:00:00
2025-05-07T20:28:12.3482569Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:28:12.3484270Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:12.3485588Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 130.8 MB/s eta 0:00:00
2025-05-07T20:28:12.3488170Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:12.3490597Z 
2025-05-07T20:28:12.3493649Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:28:12.3496755Z 
2025-05-07T20:28:14.5530761Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:28:14.5533247Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:28:18.0687582Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:21.6346036Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:21.6346626Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:25.0705033Z True
2025-05-07T20:28:25.0705286Z True
2025-05-07T20:28:25.0705392Z 
2025-05-07T20:28:25.1333039Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:25.1380211Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:25.1380813Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:25.1393355Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:25.1393756Z env:
2025-05-07T20:28:25.1393989Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:25.1394295Z   BUILD_ENV: build_binary
2025-05-07T20:28:25.1394730Z   BUILD_TARGET: genai
2025-05-07T20:28:25.1394962Z   BUILD_VARIANT: cuda
2025-05-07T20:28:25.1395201Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:25.1395455Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:25.1395760Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:25.1396103Z ##[endgroup]
2025-05-07T20:28:25.4785460Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:25.4787480Z ################################################################################
2025-05-07T20:28:25.4787947Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:25.4788304Z #
2025-05-07T20:28:25.4803434Z # [2025-05-07T20:28:25.480Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:25.4803953Z ################################################################################
2025-05-07T20:28:25.4804174Z 
2025-05-07T20:28:25.4819272Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:25.5740553Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:25.5751113Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:25.5751750Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:25.5752163Z 
2025-05-07T20:28:25.6635635Z 
2025-05-07T20:28:25.6636319Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:25.6659588Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:31.6117354Z Collecting environment information...
2025-05-07T20:28:31.6117895Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:28:31.6118290Z Is debug build: False
2025-05-07T20:28:31.6118540Z CUDA used to build PyTorch: 12.8
2025-05-07T20:28:31.6118816Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:31.6118990Z 
2025-05-07T20:28:31.6119101Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:31.6119497Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:31.6119965Z Clang version: Could not collect
2025-05-07T20:28:31.6120362Z CMake version: Could not collect
2025-05-07T20:28:31.6120721Z Libc version: glibc-2.34
2025-05-07T20:28:31.6120941Z 
2025-05-07T20:28:31.6121296Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:31.6121999Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:31.6122497Z Is CUDA available: True
2025-05-07T20:28:31.6122753Z CUDA runtime version: 12.8.61
2025-05-07T20:28:31.6123020Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:31.6123330Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:31.6123653Z Nvidia driver version: 570.133.07
2025-05-07T20:28:31.6123929Z cuDNN version: Could not collect
2025-05-07T20:28:31.6124197Z HIP runtime version: N/A
2025-05-07T20:28:31.6124442Z MIOpen runtime version: N/A
2025-05-07T20:28:31.6124738Z Is XNNPACK available: True
2025-05-07T20:28:31.6124963Z 
2025-05-07T20:28:31.6125062Z CPU:
2025-05-07T20:28:31.6125281Z Architecture:                         x86_64
2025-05-07T20:28:31.6125615Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:31.6126009Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:31.6126398Z Byte Order:                           Little Endian
2025-05-07T20:28:31.6126712Z CPU(s):                               16
2025-05-07T20:28:31.6127010Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:31.6127777Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:31.6128124Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:31.6128447Z CPU family:                           23
2025-05-07T20:28:31.6128735Z Model:                                49
2025-05-07T20:28:31.6129023Z Thread(s) per core:                   2
2025-05-07T20:28:31.6129310Z Core(s) per socket:                   8
2025-05-07T20:28:31.6129594Z Socket(s):                            1
2025-05-07T20:28:31.6130022Z Stepping:                             0
2025-05-07T20:28:31.6130321Z BogoMIPS:                             5600.00
2025-05-07T20:28:31.6132627Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:31.6134775Z Hypervisor vendor:                    KVM
2025-05-07T20:28:31.6135090Z Virtualization type:                  full
2025-05-07T20:28:31.6135430Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:31.6135795Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:31.6136154Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:31.6136505Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:31.6136829Z NUMA node(s):                         1
2025-05-07T20:28:31.6137117Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:31.6137456Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:31.6137840Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:31.6138199Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:31.6138555Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:31.6138912Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:31.6139266Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:31.6139673Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:31.6140231Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:31.6140824Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:31.6141374Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:31.6142064Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:31.6142939Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:31.6143630Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:31.6143992Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:31.6144229Z 
2025-05-07T20:28:31.6144332Z Versions of relevant libraries:
2025-05-07T20:28:31.6144600Z [pip3] numpy==2.2.5
2025-05-07T20:28:31.6144846Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:28:31.6145149Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:28:31.6145468Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:28:31.6145791Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:28:31.6146101Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:28:31.6146395Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:28:31.6146692Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:28:31.6146988Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:28:31.6147299Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:28:31.6147735Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:31.6148033Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:31.6148323Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:28:31.6148630Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:28:31.6148921Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:31.6149221Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:28:31.6149600Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:31.6150177Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:31.6150689Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:31.6151221Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:31.6151765Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:31.6152306Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:31.6152792Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6153267Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:28:31.6153754Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:31.6154252Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:31.6154736Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6155283Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:31.6155786Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6156244Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6156725Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:28:31.6157215Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:28:31.6157680Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:31.6158153Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:31.6158625Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6159091Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:28:31.6159563Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6160036Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:28:31.6160518Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:31.6161006Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:31.6161551Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6162044Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:28:31.6162535Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:31.6163021Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:31.6163488Z [conda] numpy                     2.2.5           py312h72c5963_0    conda-forge
2025-05-07T20:28:31.6163956Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:28:31.6164462Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:28:31.6164964Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:28:31.6165475Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:28:31.6165971Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:28:31.6166546Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:28:31.6167038Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:28:31.6167533Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:28:31.6168031Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:28:31.6168530Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:31.6169112Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:31.6169604Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:28:31.6170087Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:28:31.6170575Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:31.6171046Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:28:31.6171322Z 
2025-05-07T20:28:31.6850998Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:31.6851687Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:31.6863737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:31.6864102Z env:
2025-05-07T20:28:31.6864327Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:31.6864628Z   BUILD_ENV: build_binary
2025-05-07T20:28:31.6864900Z   BUILD_TARGET: genai
2025-05-07T20:28:31.6865133Z   BUILD_VARIANT: cuda
2025-05-07T20:28:31.6865365Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:31.6865625Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:31.6865938Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:31.6866278Z ##[endgroup]
2025-05-07T20:28:32.0240348Z ################################################################################
2025-05-07T20:28:32.0240801Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:32.0241308Z #
2025-05-07T20:28:32.0257200Z # [2025-05-07T20:28:32.025Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:32.0257768Z ################################################################################
2025-05-07T20:28:32.0258040Z 
2025-05-07T20:28:32.0274378Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:32.1189000Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:32.1210856Z [BUILD] Running git submodules update ...
2025-05-07T20:28:32.1232752Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:32.1595076Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:32.1596059Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:32.1596964Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:32.1597765Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:32.1598587Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:32.1599474Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:32.1600305Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:32.1632882Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:32.2187884Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:32.2210426Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:34.6571140Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:34.6757212Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:34.7777325Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:34.7825578Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:35.0010504Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:35.0040120Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:35.1207411Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:35.1228777Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:35.4577700Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:35.4601948Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:35.5164932Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:35.5168641Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:35.5968991Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:35.5996411Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:35.6429762Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:35.7071010Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:35.7094962Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:35.8373648Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:35.8400028Z   Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:35.9513799Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:35.9554600Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:36.0087451Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:36.0758558Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:36.0784588Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:36.1761313Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:36.1787516Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:36.3039237Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:36.3062696Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:36.4108499Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:36.4136588Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:36.5194512Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:36.5226710Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:36.6390337Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:36.6454189Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:36.7583701Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:36.7645162Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:36.8220777Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:36.8763714Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:36.8791341Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:36.9301160Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:36.9806470Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:36.9829482Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:37.0347434Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:37.0991443Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:37.1018118Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:37.1627664Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:37.2270334Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:37.2816363Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:37.8131886Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 52.4 MB/s eta 0:00:00
2025-05-07T20:28:37.8159527Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:37.8708229Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:37.9334823Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:37.9908929Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:38.0570349Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:38.1145430Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
2025-05-07T20:28:38.1827508Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 7.7 MB/s eta 0:00:00
2025-05-07T20:28:38.1893344Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:38.2395234Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:38.2919076Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:38.3405514Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:38.3950502Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:38.4503835Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:38.5021232Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:38.5555920Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:38.6003550Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:38.6496577Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:38.8152659Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:41.1710007Z 
2025-05-07T20:28:41.1755383Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:41.3482656Z ################################################################################
2025-05-07T20:28:41.3483005Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:41.3483266Z #
2025-05-07T20:28:41.3500203Z # [2025-05-07T20:28:41.349Z] + install_triton_pip build_binary
2025-05-07T20:28:41.3500599Z ################################################################################
2025-05-07T20:28:41.3500816Z 
2025-05-07T20:28:41.3501052Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:41.3501490Z ################################################################################
2025-05-07T20:28:41.3501854Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:41.3502180Z #
2025-05-07T20:28:41.3517557Z # [2025-05-07T20:28:41.351Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:41.3518093Z ################################################################################
2025-05-07T20:28:41.3518315Z 
2025-05-07T20:28:41.3533029Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:41.4403949Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:41.4404405Z ################################################################################
2025-05-07T20:28:41.4405008Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:41.4405301Z #
2025-05-07T20:28:41.4421652Z # [2025-05-07T20:28:41.441Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:41.4422146Z ################################################################################
2025-05-07T20:28:41.4422377Z 
2025-05-07T20:28:41.4470747Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:41.4487424Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:41.4488168Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:41.4497258Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:41.4507556Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:41.4529430Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:49.3020067Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:49.3021324Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:49.3022021Z 
2025-05-07T20:28:49.3022233Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:49.3022659Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:49.3023481Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:49.3024740Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:49.3025856Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 53.2 MB/s eta 0:00:00
2025-05-07T20:28:49.3026280Z Installing collected packages: pytorch-triton
2025-05-07T20:28:49.3026656Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:49.3027060Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:49.3027496Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:49.3027930Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:49.3028384Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:49.3028657Z 
2025-05-07T20:28:51.5054245Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:51.5058047Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:53.6583983Z ################################################################################
2025-05-07T20:28:53.6584442Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:53.6584869Z ################################################################################
2025-05-07T20:28:53.6585085Z 
2025-05-07T20:28:55.6993483Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:57.8735904Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:57.8740092Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:57.8772454Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:57.8772961Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:57.8784594Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:57.8784944Z env:
2025-05-07T20:28:57.8785170Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:57.8785472Z   BUILD_ENV: build_binary
2025-05-07T20:28:57.8785724Z   BUILD_TARGET: genai
2025-05-07T20:28:57.8785958Z   BUILD_VARIANT: cuda
2025-05-07T20:28:57.8786196Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:57.8786673Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:57.8786985Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:57.8787333Z ##[endgroup]
2025-05-07T20:28:58.2159091Z ################################################################################
2025-05-07T20:28:58.2159526Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:58.2159794Z #
2025-05-07T20:28:58.2174341Z # [2025-05-07T20:28:58.217Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2175025Z ################################################################################
2025-05-07T20:28:58.2175251Z 
2025-05-07T20:28:58.2175625Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2176343Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2176694Z 
2025-05-07T20:28:58.2336664Z f50ab0f907b8f67d4668daa75040e0b225eb54da  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2338812Z 
2025-05-07T20:28:58.2339708Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2525415Z 
2025-05-07T20:28:58.2526033Z 84bef3f6640ba9766f361e68bdc3f73d7442e779219a4c2a79fb3e077b76dfbc  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2528414Z 
2025-05-07T20:28:58.2528791Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2529139Z 
2025-05-07T20:28:58.2859215Z 9355e644e981da7f530670dbccbd5e53  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:28:58.2861538Z 
2025-05-07T20:28:58.2870713Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:58.2891893Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:01.0879849Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:01.0880865Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:01.0881760Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:01.0882206Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:01.0882480Z 
2025-05-07T20:29:08.0845123Z ################################################################################
2025-05-07T20:29:08.0846071Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:08.0846768Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:08.0847549Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:08.0848125Z [CHECK]
2025-05-07T20:29:08.0848704Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:08.0849667Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:08.0850382Z ################################################################################
2025-05-07T20:29:08.0850769Z 
2025-05-07T20:29:08.0850980Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:12.1107005Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:16.1376674Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:20.1501262Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:20.1504678Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:32.1537439Z ################################################################################
2025-05-07T20:29:32.1537879Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:32.1538225Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:32.1538577Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:32.1539306Z ################################################################################
2025-05-07T20:29:32.1539533Z 
2025-05-07T20:29:40.1384034Z ################################################################################
2025-05-07T20:29:40.1384599Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:40.1386105Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:40.1387746Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:40.1388285Z ################################################################################
2025-05-07T20:29:40.1388524Z 
2025-05-07T20:29:40.1388684Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:44.1598547Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:48.1504303Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:52.2467847Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:56.2494153Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:56.2499316Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:00.1497425Z fbgemm.nccl_init
2025-05-07T20:30:00.1497644Z 
2025-05-07T20:30:00.2116337Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:04.1174090Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:04.1174317Z 
2025-05-07T20:30:04.1802120Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:08.0741179Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:08.0741438Z 
2025-05-07T20:30:08.1367185Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:08.1367798Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:08.1404092Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:08.1404574Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:08.1418125Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:08.1418491Z env:
2025-05-07T20:30:08.1428351Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:08.1428679Z   BUILD_ENV: build_binary
2025-05-07T20:30:08.1428927Z   BUILD_TARGET: genai
2025-05-07T20:30:08.1429160Z   BUILD_VARIANT: cuda
2025-05-07T20:30:08.1429424Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:08.1429711Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:08.1430014Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:08.1430371Z ##[endgroup]
2025-05-07T20:30:08.4796341Z ################################################################################
2025-05-07T20:30:08.4796688Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:08.4796945Z #
2025-05-07T20:30:08.4814086Z # [2025-05-07T20:30:08.480Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:08.4814506Z ################################################################################
2025-05-07T20:30:08.4814722Z 
2025-05-07T20:30:16.4134577Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:16.4135186Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:16.4135590Z [TEST] Determined the test directories:
2025-05-07T20:30:16.4135920Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:16.4136236Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:16.4136552Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:16.4136747Z 
2025-05-07T20:30:16.4143212Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:16.4150213Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:16.4150666Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:16.4150952Z 
2025-05-07T20:30:16.8380549Z 
2025-05-07T20:30:16.8380882Z [TEST] Installing PyTest ...
2025-05-07T20:30:16.8404797Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:17.9327381Z Channels:
2025-05-07T20:30:17.9327691Z  - conda-forge
2025-05-07T20:30:17.9327986Z Platform: linux-64
2025-05-07T20:30:21.1923987Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:22.3319514Z Solving environment: \ | / done
2025-05-07T20:30:22.5601493Z 
2025-05-07T20:30:22.5602080Z ## Package Plan ##
2025-05-07T20:30:22.5602400Z 
2025-05-07T20:30:22.5602884Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:22.5603503Z 
2025-05-07T20:30:22.5603714Z   added / updated specs:
2025-05-07T20:30:22.5604003Z     - expecttest
2025-05-07T20:30:22.5604257Z     - pytest
2025-05-07T20:30:22.5604385Z 
2025-05-07T20:30:22.5604391Z 
2025-05-07T20:30:22.5604528Z The following packages will be downloaded:
2025-05-07T20:30:22.5604767Z 
2025-05-07T20:30:22.5604885Z     package                    |            build
2025-05-07T20:30:22.5605225Z     ---------------------------|-----------------
2025-05-07T20:30:22.5605624Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:22.5606103Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:22.5606818Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:22.5607278Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:22.5607735Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:22.5608172Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:22.5608616Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:22.5609332Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:22.5609738Z     ------------------------------------------------------------
2025-05-07T20:30:22.5610084Z                                            Total:         428 KB
2025-05-07T20:30:22.5610305Z 
2025-05-07T20:30:22.5610434Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:22.5610658Z 
2025-05-07T20:30:22.5610869Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:22.5611383Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:22.5611919Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:22.5612470Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:22.5612951Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:22.5613403Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:22.5613845Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:22.5614280Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:22.5614538Z 
2025-05-07T20:30:22.5614543Z 
2025-05-07T20:30:22.5614547Z 
2025-05-07T20:30:22.5614701Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:22.5615082Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:22.5615313Z 
2025-05-07T20:30:22.5615592Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:22.5615837Z 
2025-05-07T20:30:22.5615841Z 
2025-05-07T20:30:22.5625035Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:22.5625285Z 
2025-05-07T20:30:22.5625296Z 
2025-05-07T20:30:22.5625300Z 
2025-05-07T20:30:22.5641469Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:22.5641921Z 
2025-05-07T20:30:22.5641926Z 
2025-05-07T20:30:22.5641929Z 
2025-05-07T20:30:22.5641933Z 
2025-05-07T20:30:22.5650405Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:22.5650754Z 
2025-05-07T20:30:22.5650759Z 
2025-05-07T20:30:22.5650762Z 
2025-05-07T20:30:22.5650766Z 
2025-05-07T20:30:22.5650777Z 
2025-05-07T20:30:22.5651565Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:22.5651813Z 
2025-05-07T20:30:22.5653654Z 
2025-05-07T20:30:22.5653901Z 
2025-05-07T20:30:22.5653909Z 
2025-05-07T20:30:22.5653913Z 
2025-05-07T20:30:22.5653947Z 
2025-05-07T20:30:22.5654330Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:22.5654630Z 
2025-05-07T20:30:22.5654644Z 
2025-05-07T20:30:22.5654648Z 
2025-05-07T20:30:22.5654652Z 
2025-05-07T20:30:22.5654655Z 
2025-05-07T20:30:22.5654659Z 
2025-05-07T20:30:22.5654662Z 
2025-05-07T20:30:22.6238596Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:22.6238952Z 
2025-05-07T20:30:22.6238957Z 
2025-05-07T20:30:22.6238961Z 
2025-05-07T20:30:22.6250447Z 
2025-05-07T20:30:22.6303498Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:22.6303941Z 
2025-05-07T20:30:22.6303945Z 
2025-05-07T20:30:22.6315502Z 
2025-05-07T20:30:22.7162284Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:22.7162673Z 
2025-05-07T20:30:22.7162680Z 
2025-05-07T20:30:22.7162687Z 
2025-05-07T20:30:22.7162694Z 
2025-05-07T20:30:22.7162713Z 
2025-05-07T20:30:22.7162756Z 
2025-05-07T20:30:22.7213975Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:22.7214324Z 
2025-05-07T20:30:22.7214328Z 
2025-05-07T20:30:22.7214344Z 
2025-05-07T20:30:22.7214348Z 
2025-05-07T20:30:22.7215100Z 
2025-05-07T20:30:22.7244044Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:22.7244426Z 
2025-05-07T20:30:22.7244438Z 
2025-05-07T20:30:22.7244442Z 
2025-05-07T20:30:22.7244446Z 
2025-05-07T20:30:22.7244450Z 
2025-05-07T20:30:22.7244454Z 
2025-05-07T20:30:22.7413470Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:22.7414300Z 
2025-05-07T20:30:22.7414310Z 
2025-05-07T20:30:22.7414315Z 
2025-05-07T20:30:22.7414320Z 
2025-05-07T20:30:22.7414325Z 
2025-05-07T20:30:22.8389799Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:22.8390110Z 
2025-05-07T20:30:22.8390118Z 
2025-05-07T20:30:22.8390123Z 
2025-05-07T20:30:22.8390129Z 
2025-05-07T20:30:22.8390133Z 
2025-05-07T20:30:22.8390138Z 
2025-05-07T20:30:22.8393983Z 
2025-05-07T20:30:22.8435801Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:22.8436157Z 
2025-05-07T20:30:22.8436165Z 
2025-05-07T20:30:22.8436173Z 
2025-05-07T20:30:22.8436182Z 
2025-05-07T20:30:22.8436190Z 
2025-05-07T20:30:22.8436199Z 
2025-05-07T20:30:22.8436245Z 
2025-05-07T20:30:22.8668683Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:22.8668970Z 
2025-05-07T20:30:22.8668975Z 
2025-05-07T20:30:22.8668979Z 
2025-05-07T20:30:22.8672759Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:22.8673025Z 
2025-05-07T20:30:22.8673029Z 
2025-05-07T20:30:22.8673033Z 
2025-05-07T20:30:22.8680688Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:22.8680946Z 
2025-05-07T20:30:22.8680950Z 
2025-05-07T20:30:22.8680954Z 
2025-05-07T20:30:22.8682001Z 
2025-05-07T20:30:22.8687405Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:22.8687701Z 
2025-05-07T20:30:22.8687705Z 
2025-05-07T20:30:22.8687708Z 
2025-05-07T20:30:22.8689120Z 
2025-05-07T20:30:22.8728862Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:22.8729160Z 
2025-05-07T20:30:22.8729164Z 
2025-05-07T20:30:22.8729168Z 
2025-05-07T20:30:22.8729171Z 
2025-05-07T20:30:22.8729429Z 
2025-05-07T20:30:22.8729572Z 
2025-05-07T20:30:22.8843671Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:22.8844035Z 
2025-05-07T20:30:22.8844041Z 
2025-05-07T20:30:22.8844047Z 
2025-05-07T20:30:22.8844088Z 
2025-05-07T20:30:22.8844093Z 
2025-05-07T20:30:22.8850019Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:22.8850386Z 
2025-05-07T20:30:22.8850394Z 
2025-05-07T20:30:22.8850401Z 
2025-05-07T20:30:22.8850408Z 
2025-05-07T20:30:22.8850417Z 
2025-05-07T20:30:22.8850424Z 
2025-05-07T20:30:22.8850432Z 
2025-05-07T20:30:22.8944736Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:22.8945125Z 
2025-05-07T20:30:22.8963850Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:22.8964221Z 
2025-05-07T20:30:22.8990140Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:22.8990476Z 
2025-05-07T20:30:22.8990481Z 
2025-05-07T20:30:22.8998502Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:22.8998776Z 
2025-05-07T20:30:22.9000102Z 
2025-05-07T20:30:22.9152683Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:22.9152949Z 
2025-05-07T20:30:22.9175025Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:22.9175285Z 
2025-05-07T20:30:22.9175289Z 
2025-05-07T20:30:23.0137985Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:23.0265856Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:23.0506856Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:23.0515098Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:23.0515441Z                                                      
2025-05-07T20:30:23.0515638Z 
2025-05-07T20:30:23.0515905Z                                                      [A
2025-05-07T20:30:23.0516104Z 
2025-05-07T20:30:23.0516108Z 
2025-05-07T20:30:23.0516294Z                                                      [A[A
2025-05-07T20:30:23.0516502Z 
2025-05-07T20:30:23.0516506Z 
2025-05-07T20:30:23.0516510Z 
2025-05-07T20:30:23.0516674Z                                                      [A[A[A
2025-05-07T20:30:23.0517099Z 
2025-05-07T20:30:23.0517104Z 
2025-05-07T20:30:23.0517108Z 
2025-05-07T20:30:23.0517112Z 
2025-05-07T20:30:23.0517284Z                                                      [A[A[A[A
2025-05-07T20:30:23.0517497Z 
2025-05-07T20:30:23.0517500Z 
2025-05-07T20:30:23.0517504Z 
2025-05-07T20:30:23.0517508Z 
2025-05-07T20:30:23.0517522Z 
2025-05-07T20:30:23.0517697Z                                                      [A[A[A[A[A
2025-05-07T20:30:23.0517910Z 
2025-05-07T20:30:23.0517914Z 
2025-05-07T20:30:23.0517917Z 
2025-05-07T20:30:23.0517921Z 
2025-05-07T20:30:23.0517925Z 
2025-05-07T20:30:23.0517928Z 
2025-05-07T20:30:23.0518102Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:23.0518322Z 
2025-05-07T20:30:23.0518333Z 
2025-05-07T20:30:23.0518337Z 
2025-05-07T20:30:23.0518340Z 
2025-05-07T20:30:23.0518344Z 
2025-05-07T20:30:23.0518348Z 
2025-05-07T20:30:23.0518351Z 
2025-05-07T20:30:23.0518538Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:23.1520660Z Preparing transaction: \ done
2025-05-07T20:30:23.2525564Z Verifying transaction: / done
2025-05-07T20:30:25.1556676Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:25.2810830Z [TEST] Checking imports ...
2025-05-07T20:30:29.2620881Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:29.2633196Z [TEST] Setting feature flags ...
2025-05-07T20:30:29.2633777Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:29.2634242Z 
2025-05-07T20:30:29.6836183Z 
2025-05-07T20:30:29.6837140Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:29.6839541Z ################################################################################
2025-05-07T20:30:29.6840442Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:29.6841090Z #
2025-05-07T20:30:29.6860767Z # [2025-05-07T20:30:29.685Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:29.6861371Z ################################################################################
2025-05-07T20:30:29.6861677Z 
2025-05-07T20:30:29.6868274Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:29.6898249Z ./attention/gqa_test.py
2025-05-07T20:30:29.6898620Z ./coalesce/coalesce_test.py
2025-05-07T20:30:29.6898999Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:29.6899381Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:29.6899787Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:29.6900051Z ./moe/activation_test.py
2025-05-07T20:30:29.6900308Z ./moe/gather_scatter_test.py
2025-05-07T20:30:29.6900559Z ./moe/layers_test.py
2025-05-07T20:30:29.6900796Z ./moe/shuffling_test.py
2025-05-07T20:30:29.6901056Z ./quantize/quantize_test.py
2025-05-07T20:30:29.6901221Z 
2025-05-07T20:30:29.6901338Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:29.6901556Z 
2025-05-07T20:30:29.6920190Z ################################################################################
2025-05-07T20:30:29.6935807Z # [2025-05-07T20:30:29.693Z] Run Python Test Suite:
2025-05-07T20:30:29.6936281Z #   ./attention/gqa_test.py
2025-05-07T20:30:29.6936662Z ################################################################################
2025-05-07T20:30:29.6960681Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:29.6961475Z 
2025-05-07T20:30:32.2255868Z ============================= test session starts ==============================
2025-05-07T20:30:32.2256662Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:32.2257284Z cachedir: .pytest_cache
2025-05-07T20:30:32.2257881Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:32.2258957Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:32.2259384Z plugins: hypothesis-6.131.14
2025-05-07T20:30:33.9172243Z collecting ... collected 2 items
2025-05-07T20:30:33.9172550Z 
2025-05-07T20:31:11.8449991Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:11.8450600Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8450997Z     int4_kv=False,
2025-05-07T20:31:11.8451261Z     num_groups=1,
2025-05-07T20:31:11.8451503Z     B=1,
2025-05-07T20:31:11.8451730Z     MAX_T=4,
2025-05-07T20:31:11.8452071Z     N_H_L=1,
2025-05-07T20:31:11.8452301Z )
2025-05-07T20:31:11.8452538Z Trying example: test_gqa(
2025-05-07T20:31:11.8452896Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8453320Z     int4_kv=True,
2025-05-07T20:31:11.8453569Z     num_groups=1,
2025-05-07T20:31:11.8453821Z     B=1,
2025-05-07T20:31:11.8454054Z     MAX_T=4,
2025-05-07T20:31:11.8454280Z     N_H_L=1,
2025-05-07T20:31:11.8454507Z )
2025-05-07T20:31:11.8454770Z Trying example: test_gqa(
2025-05-07T20:31:11.8455127Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8455517Z     int4_kv=True,
2025-05-07T20:31:11.8455768Z     num_groups=4,
2025-05-07T20:31:11.8456008Z     B=23,
2025-05-07T20:31:11.8456237Z     MAX_T=33,
2025-05-07T20:31:11.8456474Z     N_H_L=68,
2025-05-07T20:31:11.8456695Z )
2025-05-07T20:31:11.8456927Z Trying example: test_gqa(
2025-05-07T20:31:11.8457277Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8457656Z     int4_kv=True,
2025-05-07T20:31:11.8457910Z     num_groups=4,
2025-05-07T20:31:11.8458155Z     B=77,
2025-05-07T20:31:11.8458379Z     MAX_T=4,
2025-05-07T20:31:11.8458611Z     N_H_L=1,
2025-05-07T20:31:11.8458857Z )
2025-05-07T20:31:11.8459472Z Trying example: test_gqa(
2025-05-07T20:31:11.8459823Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8460207Z     int4_kv=True,
2025-05-07T20:31:11.8460448Z     num_groups=4,
2025-05-07T20:31:11.8460694Z     B=77,
2025-05-07T20:31:11.8460927Z     MAX_T=52,
2025-05-07T20:31:11.8461154Z     N_H_L=67,
2025-05-07T20:31:11.8461384Z )
2025-05-07T20:31:11.8461613Z Trying example: test_gqa(
2025-05-07T20:31:11.8461954Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8462339Z     int4_kv=False,
2025-05-07T20:31:11.8462592Z     num_groups=4,
2025-05-07T20:31:11.8462836Z     B=57,
2025-05-07T20:31:11.8463052Z     MAX_T=45,
2025-05-07T20:31:11.8463289Z     N_H_L=120,
2025-05-07T20:31:11.8463521Z )
2025-05-07T20:31:11.8463745Z Trying example: test_gqa(
2025-05-07T20:31:11.8464093Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8464472Z     int4_kv=True,
2025-05-07T20:31:11.8464714Z     num_groups=4,
2025-05-07T20:31:11.8464967Z     B=52,
2025-05-07T20:31:11.8465191Z     MAX_T=42,
2025-05-07T20:31:11.8465417Z     N_H_L=53,
2025-05-07T20:31:11.8465645Z )
2025-05-07T20:31:11.8465874Z Trying example: test_gqa(
2025-05-07T20:31:11.8466217Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8466607Z     int4_kv=True,
2025-05-07T20:31:11.8466860Z     num_groups=1,
2025-05-07T20:31:11.8467095Z     B=77,
2025-05-07T20:31:11.8467317Z     MAX_T=95,
2025-05-07T20:31:11.8467548Z     N_H_L=53,
2025-05-07T20:31:11.8467769Z )
2025-05-07T20:31:11.8467996Z Trying example: test_gqa(
2025-05-07T20:31:11.8468543Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8468927Z     int4_kv=True,
2025-05-07T20:31:11.8469175Z     num_groups=4,
2025-05-07T20:31:11.8469412Z     B=113,
2025-05-07T20:31:11.8469636Z     MAX_T=48,
2025-05-07T20:31:11.8469867Z     N_H_L=96,
2025-05-07T20:31:11.8470092Z )
2025-05-07T20:31:11.8470320Z Trying example: test_gqa(
2025-05-07T20:31:11.8470666Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8471052Z     int4_kv=False,
2025-05-07T20:31:11.8471308Z     num_groups=1,
2025-05-07T20:31:11.8471551Z     B=51,
2025-05-07T20:31:11.8471766Z     MAX_T=61,
2025-05-07T20:31:11.8471999Z     N_H_L=69,
2025-05-07T20:31:11.8472438Z )
2025-05-07T20:31:11.8472670Z Trying example: test_gqa(
2025-05-07T20:31:11.8473020Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8473401Z     int4_kv=False,
2025-05-07T20:31:11.8473647Z     num_groups=4,
2025-05-07T20:31:11.8473892Z     B=17,
2025-05-07T20:31:11.8474115Z     MAX_T=113,
2025-05-07T20:31:11.8474345Z     N_H_L=65,
2025-05-07T20:31:11.8474581Z )
2025-05-07T20:31:11.8474809Z Trying example: test_gqa(
2025-05-07T20:31:11.8475155Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8475533Z     int4_kv=False,
2025-05-07T20:31:11.8475785Z     num_groups=4,
2025-05-07T20:31:11.8476027Z     B=17,
2025-05-07T20:31:11.8476244Z     MAX_T=65,
2025-05-07T20:31:11.8476475Z     N_H_L=65,
2025-05-07T20:31:11.8476708Z )
2025-05-07T20:31:11.8476935Z Trying example: test_gqa(
2025-05-07T20:31:11.8477281Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8477662Z     int4_kv=False,
2025-05-07T20:31:11.8477909Z     num_groups=4,
2025-05-07T20:31:11.8478161Z     B=65,
2025-05-07T20:31:11.8478387Z     MAX_T=65,
2025-05-07T20:31:11.8478613Z     N_H_L=65,
2025-05-07T20:31:11.8478839Z )
2025-05-07T20:31:11.8479069Z Trying example: test_gqa(
2025-05-07T20:31:11.8479410Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8479793Z     int4_kv=False,
2025-05-07T20:31:11.8480046Z     num_groups=1,
2025-05-07T20:31:11.8480284Z     B=6,
2025-05-07T20:31:11.8480509Z     MAX_T=108,
2025-05-07T20:31:11.8480746Z     N_H_L=14,
2025-05-07T20:31:11.8480968Z )
2025-05-07T20:31:11.8481195Z Trying example: test_gqa(
2025-05-07T20:31:11.8481540Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8481915Z     int4_kv=False,
2025-05-07T20:31:11.8482168Z     num_groups=1,
2025-05-07T20:31:11.8482514Z     B=6,
2025-05-07T20:31:11.8482731Z     MAX_T=14,
2025-05-07T20:31:11.8482963Z     N_H_L=14,
2025-05-07T20:31:11.8483189Z )
2025-05-07T20:31:11.8483419Z Trying example: test_gqa(
2025-05-07T20:31:11.8483765Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8484149Z     int4_kv=False,
2025-05-07T20:31:11.8484401Z     num_groups=1,
2025-05-07T20:31:11.8484638Z     B=6,
2025-05-07T20:31:11.8484860Z     MAX_T=6,
2025-05-07T20:31:11.8485092Z     N_H_L=14,
2025-05-07T20:31:11.8485313Z )
2025-05-07T20:31:11.8485544Z Trying example: test_gqa(
2025-05-07T20:31:11.8485892Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8486267Z     int4_kv=False,
2025-05-07T20:31:11.8486518Z     num_groups=1,
2025-05-07T20:31:11.8486764Z     B=6,
2025-05-07T20:31:11.8486977Z     MAX_T=6,
2025-05-07T20:31:11.8487207Z     N_H_L=6,
2025-05-07T20:31:11.8487438Z )
2025-05-07T20:31:11.8487660Z Trying example: test_gqa(
2025-05-07T20:31:11.8488040Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8488455Z     int4_kv=False,
2025-05-07T20:31:11.8488700Z     num_groups=1,
2025-05-07T20:31:11.8488946Z     B=70,
2025-05-07T20:31:11.8489167Z     MAX_T=94,
2025-05-07T20:31:11.8489392Z     N_H_L=78,
2025-05-07T20:31:11.8489627Z )
2025-05-07T20:31:11.8489858Z Trying example: test_gqa(
2025-05-07T20:31:11.8490199Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8490581Z     int4_kv=False,
2025-05-07T20:31:11.8490834Z     num_groups=1,
2025-05-07T20:31:11.8491073Z     B=78,
2025-05-07T20:31:11.8491305Z     MAX_T=94,
2025-05-07T20:31:11.8491538Z     N_H_L=78,
2025-05-07T20:31:11.8491839Z )
2025-05-07T20:31:11.8492074Z Trying example: test_gqa(
2025-05-07T20:31:11.8492424Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8492810Z     int4_kv=False,
2025-05-07T20:31:11.8493057Z     num_groups=1,
2025-05-07T20:31:11.8493302Z     B=94,
2025-05-07T20:31:11.8493525Z     MAX_T=94,
2025-05-07T20:31:11.8493759Z     N_H_L=78,
2025-05-07T20:31:11.8493985Z )
2025-05-07T20:31:11.8494212Z Trying example: test_gqa(
2025-05-07T20:31:11.8494553Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8494935Z     int4_kv=False,
2025-05-07T20:31:11.8495295Z     num_groups=1,
2025-05-07T20:31:11.8495540Z     B=94,
2025-05-07T20:31:11.8495764Z     MAX_T=94,
2025-05-07T20:31:11.8504662Z     N_H_L=94,
2025-05-07T20:31:11.8504978Z )
2025-05-07T20:31:11.8505262Z Trying example: test_gqa(
2025-05-07T20:31:11.8505597Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8505917Z     int4_kv=False,
2025-05-07T20:31:11.8506140Z     num_groups=4,
2025-05-07T20:31:11.8506584Z     B=41,
2025-05-07T20:31:11.8506773Z     MAX_T=105,
2025-05-07T20:31:11.8506979Z     N_H_L=126,
2025-05-07T20:31:11.8507178Z )
2025-05-07T20:31:11.8507367Z Trying example: test_gqa(
2025-05-07T20:31:11.8507667Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8507986Z     int4_kv=False,
2025-05-07T20:31:11.8508202Z     num_groups=4,
2025-05-07T20:31:11.8508406Z     B=105,
2025-05-07T20:31:11.8508598Z     MAX_T=105,
2025-05-07T20:31:11.8508795Z     N_H_L=126,
2025-05-07T20:31:11.8508996Z )
2025-05-07T20:31:11.8509189Z Trying example: test_gqa(
2025-05-07T20:31:11.8509483Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8509803Z     int4_kv=False,
2025-05-07T20:31:11.8510015Z     num_groups=4,
2025-05-07T20:31:11.8510214Z     B=105,
2025-05-07T20:31:11.8510403Z     MAX_T=105,
2025-05-07T20:31:11.8510599Z     N_H_L=105,
2025-05-07T20:31:11.8510784Z )
2025-05-07T20:31:11.8510974Z Trying example: test_gqa(
2025-05-07T20:31:11.8511265Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8511572Z     int4_kv=True,
2025-05-07T20:31:11.8511779Z     num_groups=1,
2025-05-07T20:31:11.8511982Z     B=95,
2025-05-07T20:31:11.8512158Z     MAX_T=114,
2025-05-07T20:31:11.8512357Z     N_H_L=43,
2025-05-07T20:31:11.8512552Z )
2025-05-07T20:31:11.8512734Z Trying example: test_gqa(
2025-05-07T20:31:11.8513219Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8513537Z     int4_kv=True,
2025-05-07T20:31:11.8513740Z     num_groups=1,
2025-05-07T20:31:11.8513945Z     B=43,
2025-05-07T20:31:11.8514133Z     MAX_T=114,
2025-05-07T20:31:11.8514326Z     N_H_L=43,
2025-05-07T20:31:11.8514516Z )
2025-05-07T20:31:11.8514705Z Trying example: test_gqa(
2025-05-07T20:31:11.8514996Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8515302Z     int4_kv=True,
2025-05-07T20:31:11.8515508Z     num_groups=1,
2025-05-07T20:31:11.8515710Z     B=43,
2025-05-07T20:31:11.8515890Z     MAX_T=43,
2025-05-07T20:31:11.8516085Z     N_H_L=43,
2025-05-07T20:31:11.8516275Z )
2025-05-07T20:31:11.8516463Z Trying example: test_gqa(
2025-05-07T20:31:11.8516754Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8517070Z     int4_kv=False,
2025-05-07T20:31:11.8517276Z     num_groups=1,
2025-05-07T20:31:11.8517480Z     B=21,
2025-05-07T20:31:11.8517665Z     MAX_T=38,
2025-05-07T20:31:11.8517859Z     N_H_L=42,
2025-05-07T20:31:11.8518051Z )
2025-05-07T20:31:11.8518245Z Trying example: test_gqa(
2025-05-07T20:31:11.8518530Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8518848Z     int4_kv=False,
2025-05-07T20:31:11.8519058Z     num_groups=1,
2025-05-07T20:31:11.8519255Z     B=38,
2025-05-07T20:31:11.8519441Z     MAX_T=38,
2025-05-07T20:31:11.8519641Z     N_H_L=42,
2025-05-07T20:31:11.8519826Z )
2025-05-07T20:31:11.8520016Z Trying example: test_gqa(
2025-05-07T20:31:11.8520308Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8520615Z     int4_kv=False,
2025-05-07T20:31:11.8520827Z     num_groups=1,
2025-05-07T20:31:11.8521031Z     B=38,
2025-05-07T20:31:11.8521209Z     MAX_T=42,
2025-05-07T20:31:11.8521409Z     N_H_L=42,
2025-05-07T20:31:11.8521601Z )
2025-05-07T20:31:11.8521789Z Trying example: test_gqa(
2025-05-07T20:31:11.8522082Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8522403Z     int4_kv=False,
2025-05-07T20:31:11.8522619Z     num_groups=1,
2025-05-07T20:31:11.8522817Z     B=42,
2025-05-07T20:31:11.8523004Z     MAX_T=42,
2025-05-07T20:31:11.8523201Z     N_H_L=42,
2025-05-07T20:31:11.8523387Z )
2025-05-07T20:31:11.8523774Z Trying example: test_gqa(
2025-05-07T20:31:11.8524069Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8524375Z     int4_kv=True,
2025-05-07T20:31:11.8524586Z     num_groups=1,
2025-05-07T20:31:11.8524796Z     B=74,
2025-05-07T20:31:11.8524979Z     MAX_T=20,
2025-05-07T20:31:11.8525174Z     N_H_L=15,
2025-05-07T20:31:11.8525372Z )
2025-05-07T20:31:11.8525556Z Trying example: test_gqa(
2025-05-07T20:31:11.8525845Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8526155Z     int4_kv=True,
2025-05-07T20:31:11.8526357Z     num_groups=1,
2025-05-07T20:31:11.8526559Z     B=20,
2025-05-07T20:31:11.8526746Z     MAX_T=20,
2025-05-07T20:31:11.8526933Z     N_H_L=15,
2025-05-07T20:31:11.8527123Z )
2025-05-07T20:31:11.8527324Z Trying example: test_gqa(
2025-05-07T20:31:11.8527609Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8527921Z     int4_kv=True,
2025-05-07T20:31:11.8528150Z     num_groups=1,
2025-05-07T20:31:11.8528375Z     B=20,
2025-05-07T20:31:11.8528567Z     MAX_T=15,
2025-05-07T20:31:11.8528768Z     N_H_L=15,
2025-05-07T20:31:11.8528953Z )
2025-05-07T20:31:11.8529145Z Trying example: test_gqa(
2025-05-07T20:31:11.8529439Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8529751Z     int4_kv=True,
2025-05-07T20:31:11.8529951Z     num_groups=1,
2025-05-07T20:31:11.8530155Z     B=15,
2025-05-07T20:31:11.8530341Z     MAX_T=20,
2025-05-07T20:31:11.8530532Z     N_H_L=15,
2025-05-07T20:31:11.8530723Z )
2025-05-07T20:31:11.8530916Z Trying example: test_gqa(
2025-05-07T20:31:11.8531199Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8531513Z     int4_kv=True,
2025-05-07T20:31:11.8531724Z     num_groups=1,
2025-05-07T20:31:11.8532088Z     B=15,
2025-05-07T20:31:11.8532442Z     MAX_T=15,
2025-05-07T20:31:11.8532708Z     N_H_L=15,
2025-05-07T20:31:11.8532966Z )
2025-05-07T20:31:11.8533235Z Trying example: test_gqa(
2025-05-07T20:31:11.8533650Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8534090Z     int4_kv=False,
2025-05-07T20:31:11.8534384Z     num_groups=4,
2025-05-07T20:31:11.8534660Z     B=117,
2025-05-07T20:31:11.8534905Z     MAX_T=104,
2025-05-07T20:31:11.8535173Z     N_H_L=69,
2025-05-07T20:31:11.8535427Z )
2025-05-07T20:31:11.8535621Z Trying example: test_gqa(
2025-05-07T20:31:11.8535919Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8536237Z     int4_kv=False,
2025-05-07T20:31:11.8536443Z     num_groups=4,
2025-05-07T20:31:11.8536649Z     B=117,
2025-05-07T20:31:11.8536841Z     MAX_T=117,
2025-05-07T20:31:11.8537032Z     N_H_L=69,
2025-05-07T20:31:11.8537225Z )
2025-05-07T20:31:11.8537419Z Trying example: test_gqa(
2025-05-07T20:31:11.8537711Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8538028Z     int4_kv=False,
2025-05-07T20:31:11.8538240Z     num_groups=4,
2025-05-07T20:31:11.8538445Z     B=69,
2025-05-07T20:31:11.8538626Z     MAX_T=117,
2025-05-07T20:31:11.8538823Z     N_H_L=69,
2025-05-07T20:31:11.8539015Z )
2025-05-07T20:31:11.8539205Z Trying example: test_gqa(
2025-05-07T20:31:11.8539498Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:11.8539812Z     int4_kv=False,
2025-05-07T20:31:11.8540017Z     num_groups=4,
2025-05-07T20:31:11.8540223Z     B=117,
2025-05-07T20:31:11.8540412Z     MAX_T=69,
2025-05-07T20:31:11.8540605Z     N_H_L=69,
2025-05-07T20:31:11.8540797Z )
2025-05-07T20:31:11.8540985Z PASSED
2025-05-07T20:31:11.8645955Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:11.8646295Z 
2025-05-07T20:31:11.8646450Z =========================== short test summary info ============================
2025-05-07T20:31:11.8647190Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:11.8647917Z ======================== 1 passed, 1 skipped in 40.14s =========================
2025-05-07T20:31:12.5174084Z 
2025-05-07T20:31:12.5175049Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:12.5194203Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:31:12.5194493Z 
2025-05-07T20:31:12.5194497Z 
2025-05-07T20:31:12.5194501Z 
2025-05-07T20:31:12.5194505Z 
2025-05-07T20:31:12.5215172Z ################################################################################
2025-05-07T20:31:12.5230679Z # [2025-05-07T20:31:12.522Z] Run Python Test Suite:
2025-05-07T20:31:12.5231029Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:12.5231340Z ################################################################################
2025-05-07T20:31:12.5257470Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:12.5258137Z 
2025-05-07T20:31:14.6665674Z ============================= test session starts ==============================
2025-05-07T20:31:14.6666393Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:14.6666926Z cachedir: .pytest_cache
2025-05-07T20:31:14.6667529Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:14.6668288Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:14.6668709Z plugins: hypothesis-6.131.14
2025-05-07T20:31:16.4009791Z collecting ... collected 1 item
2025-05-07T20:31:16.4010007Z 
2025-05-07T20:31:17.1565071Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:17.1565425Z 
2025-05-07T20:31:17.1565570Z ============================== 1 passed in 2.61s ===============================
2025-05-07T20:31:17.7870992Z 
2025-05-07T20:31:17.7871683Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:17.7892322Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:17.7892648Z 
2025-05-07T20:31:17.7892654Z 
2025-05-07T20:31:17.7892658Z 
2025-05-07T20:31:17.7892662Z 
2025-05-07T20:31:17.7913469Z ################################################################################
2025-05-07T20:31:17.7928387Z # [2025-05-07T20:31:17.792Z] Run Python Test Suite:
2025-05-07T20:31:17.7928747Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:17.7929039Z ################################################################################
2025-05-07T20:31:17.7954173Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:17.7954826Z 
2025-05-07T20:31:19.9449590Z ============================= test session starts ==============================
2025-05-07T20:31:19.9450266Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:19.9450810Z cachedir: .pytest_cache
2025-05-07T20:31:19.9451422Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:19.9452471Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:19.9453087Z plugins: hypothesis-6.131.14
2025-05-07T20:31:21.6590895Z collecting ... collected 5 items
2025-05-07T20:31:21.6591210Z 
2025-05-07T20:31:21.6603913Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:21.6614470Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:21.6622521Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:21.6630512Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:21.6649538Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:21.6649904Z 
2025-05-07T20:31:21.6650453Z =========================== short test summary info ============================
2025-05-07T20:31:21.6651158Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:21.6652314Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:21.6653704Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:21.6654946Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:21.6655905Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:21.6656571Z ============================== 5 skipped in 1.84s ==============================
2025-05-07T20:31:22.2418226Z 
2025-05-07T20:31:22.2419004Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:22.2437699Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:22.2438010Z 
2025-05-07T20:31:22.2438014Z 
2025-05-07T20:31:22.2438017Z 
2025-05-07T20:31:22.2438021Z 
2025-05-07T20:31:22.2459159Z ################################################################################
2025-05-07T20:31:22.2476866Z # [2025-05-07T20:31:22.247Z] Run Python Test Suite:
2025-05-07T20:31:22.2477227Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:22.2477546Z ################################################################################
2025-05-07T20:31:22.2502358Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:22.2503702Z 
2025-05-07T20:31:24.4121596Z ============================= test session starts ==============================
2025-05-07T20:31:24.4122328Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:24.4122872Z cachedir: .pytest_cache
2025-05-07T20:31:24.4123477Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:24.4124240Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:24.4124661Z plugins: hypothesis-6.131.14
2025-05-07T20:31:26.2247005Z collecting ... collected 2 items
2025-05-07T20:31:26.2247219Z 
2025-05-07T20:31:26.2258708Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:26.2275681Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:26.2276125Z 
2025-05-07T20:31:26.2276289Z =========================== short test summary info ============================
2025-05-07T20:31:26.2276933Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:26.2277783Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:26.2278398Z ============================== 2 skipped in 1.94s ==============================
2025-05-07T20:31:26.8165989Z 
2025-05-07T20:31:26.8166479Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:26.8187534Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:31:26.8187889Z 
2025-05-07T20:31:26.8187894Z 
2025-05-07T20:31:26.8187925Z 
2025-05-07T20:31:26.8187929Z 
2025-05-07T20:31:26.8208772Z ################################################################################
2025-05-07T20:31:26.8227293Z # [2025-05-07T20:31:26.822Z] Run Python Test Suite:
2025-05-07T20:31:26.8227978Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:26.8228275Z ################################################################################
2025-05-07T20:31:26.8252354Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:26.8253298Z 
2025-05-07T20:31:28.9709482Z ============================= test session starts ==============================
2025-05-07T20:31:28.9710297Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:28.9710834Z cachedir: .pytest_cache
2025-05-07T20:31:28.9711430Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:28.9712198Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:28.9712612Z plugins: hypothesis-6.131.14
2025-05-07T20:31:30.6584497Z collecting ... collected 4 items
2025-05-07T20:31:30.6584865Z 
2025-05-07T20:31:33.4184163Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:33.4268962Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:33.4364893Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:33.4454329Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:33.4454686Z 
2025-05-07T20:31:33.4454845Z =========================== short test summary info ============================
2025-05-07T20:31:33.4455563Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:33.4456776Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:31:33.4457407Z ============================== 4 skipped in 4.60s ==============================
2025-05-07T20:31:35.3945071Z 
2025-05-07T20:31:35.3945862Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:35.3966029Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:31:35.3966338Z 
2025-05-07T20:31:35.3966342Z 
2025-05-07T20:31:35.3966346Z 
2025-05-07T20:31:35.3966351Z 
2025-05-07T20:31:35.3986919Z ################################################################################
2025-05-07T20:31:35.4001786Z # [2025-05-07T20:31:35.399Z] Run Python Test Suite:
2025-05-07T20:31:35.4002127Z #   ./moe/activation_test.py
2025-05-07T20:31:35.4002405Z ################################################################################
2025-05-07T20:31:35.4028817Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:35.4029686Z 
2025-05-07T20:31:37.5505793Z ============================= test session starts ==============================
2025-05-07T20:31:37.5506784Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:37.5507328Z cachedir: .pytest_cache
2025-05-07T20:31:37.5507926Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:37.5508679Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:37.5509090Z plugins: hypothesis-6.131.14
2025-05-07T20:31:39.2004460Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:39.3077034Z collecting ... collected 2 items
2025-05-07T20:31:39.3077250Z 
2025-05-07T20:31:44.6388003Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:44.6388844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6389675Z     T=1,
2025-05-07T20:31:44.6389913Z     D=5120,
2025-05-07T20:31:44.6390178Z     contiguous=True,
2025-05-07T20:31:44.6390488Z     compiled=True,
2025-05-07T20:31:44.6390795Z )
2025-05-07T20:31:44.6391070Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6391596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6392005Z     T=4096,
2025-05-07T20:31:44.6392197Z     D=5120,
2025-05-07T20:31:44.6392438Z     contiguous=True,
2025-05-07T20:31:44.6392756Z     compiled=True,
2025-05-07T20:31:44.6393020Z )
2025-05-07T20:31:44.6393254Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6393629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6394018Z     T=4096,
2025-05-07T20:31:44.6394226Z     D=7168,
2025-05-07T20:31:44.6394428Z     contiguous=False,
2025-05-07T20:31:44.6394654Z     compiled=False,
2025-05-07T20:31:44.6394865Z )
2025-05-07T20:31:44.6395065Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6395496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6395885Z     T=4096,
2025-05-07T20:31:44.6396079Z     D=5120,
2025-05-07T20:31:44.6396271Z     contiguous=False,
2025-05-07T20:31:44.6396501Z     compiled=True,
2025-05-07T20:31:44.6396704Z )
2025-05-07T20:31:44.6396897Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6397277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6397665Z     T=1,
2025-05-07T20:31:44.6397843Z     D=7168,
2025-05-07T20:31:44.6398040Z     contiguous=True,
2025-05-07T20:31:44.6398274Z     compiled=True,
2025-05-07T20:31:44.6398477Z )
2025-05-07T20:31:44.6398676Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6399059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6399657Z     T=1,
2025-05-07T20:31:44.6399845Z     D=7168,
2025-05-07T20:31:44.6400046Z     contiguous=False,
2025-05-07T20:31:44.6400276Z     compiled=True,
2025-05-07T20:31:44.6400483Z )
2025-05-07T20:31:44.6400681Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6401062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6401443Z     T=4096,
2025-05-07T20:31:44.6401634Z     D=5120,
2025-05-07T20:31:44.6401836Z     contiguous=False,
2025-05-07T20:31:44.6402062Z     compiled=False,
2025-05-07T20:31:44.6402273Z )
2025-05-07T20:31:44.6402474Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6402848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6403236Z     T=1,
2025-05-07T20:31:44.6403423Z     D=7168,
2025-05-07T20:31:44.6403617Z     contiguous=True,
2025-05-07T20:31:44.6403845Z     compiled=False,
2025-05-07T20:31:44.6404054Z )
2025-05-07T20:31:44.6404252Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6404636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6405021Z     T=2048,
2025-05-07T20:31:44.6405202Z     D=5120,
2025-05-07T20:31:44.6405401Z     contiguous=True,
2025-05-07T20:31:44.6405628Z     compiled=True,
2025-05-07T20:31:44.6405829Z )
2025-05-07T20:31:44.6406027Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6406780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6407168Z     T=2048,
2025-05-07T20:31:44.6407350Z     D=7168,
2025-05-07T20:31:44.6407545Z     contiguous=True,
2025-05-07T20:31:44.6407769Z     compiled=True,
2025-05-07T20:31:44.6407968Z )
2025-05-07T20:31:44.6408167Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6408546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6408926Z     T=2048,
2025-05-07T20:31:44.6409111Z     D=7168,
2025-05-07T20:31:44.6409307Z     contiguous=True,
2025-05-07T20:31:44.6409535Z     compiled=False,
2025-05-07T20:31:44.6409743Z )
2025-05-07T20:31:44.6409940Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6410311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6410869Z     T=128,
2025-05-07T20:31:44.6411069Z     D=5120,
2025-05-07T20:31:44.6411268Z     contiguous=False,
2025-05-07T20:31:44.6411494Z     compiled=True,
2025-05-07T20:31:44.6411697Z )
2025-05-07T20:31:44.6412001Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6419703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6420106Z     T=128,
2025-05-07T20:31:44.6420287Z     D=5120,
2025-05-07T20:31:44.6420482Z     contiguous=True,
2025-05-07T20:31:44.6420708Z     compiled=True,
2025-05-07T20:31:44.6420917Z )
2025-05-07T20:31:44.6421113Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6421503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6421898Z     T=16384,
2025-05-07T20:31:44.6422105Z     D=5120,
2025-05-07T20:31:44.6422304Z     contiguous=False,
2025-05-07T20:31:44.6422536Z     compiled=True,
2025-05-07T20:31:44.6422735Z )
2025-05-07T20:31:44.6422935Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6423322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6423704Z     T=16384,
2025-05-07T20:31:44.6423901Z     D=5120,
2025-05-07T20:31:44.6424106Z     contiguous=False,
2025-05-07T20:31:44.6424332Z     compiled=False,
2025-05-07T20:31:44.6424539Z )
2025-05-07T20:31:44.6424736Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6425112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6425491Z     T=128,
2025-05-07T20:31:44.6425679Z     D=7168,
2025-05-07T20:31:44.6425869Z     contiguous=True,
2025-05-07T20:31:44.6426095Z     compiled=False,
2025-05-07T20:31:44.6426300Z )
2025-05-07T20:31:44.6426489Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6426868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6427423Z     T=128,
2025-05-07T20:31:44.6427609Z     D=7168,
2025-05-07T20:31:44.6427831Z     contiguous=False,
2025-05-07T20:31:44.6428083Z     compiled=False,
2025-05-07T20:31:44.6428289Z )
2025-05-07T20:31:44.6428485Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6428863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6429246Z     T=1,
2025-05-07T20:31:44.6429429Z     D=5120,
2025-05-07T20:31:44.6429627Z     contiguous=False,
2025-05-07T20:31:44.6429848Z     compiled=False,
2025-05-07T20:31:44.6430057Z )
2025-05-07T20:31:44.6430256Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6430629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6431012Z     T=1,
2025-05-07T20:31:44.6431196Z     D=7168,
2025-05-07T20:31:44.6431386Z     contiguous=False,
2025-05-07T20:31:44.6431613Z     compiled=False,
2025-05-07T20:31:44.6431821Z )
2025-05-07T20:31:44.6432022Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6432392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6432777Z     T=4096,
2025-05-07T20:31:44.6432966Z     D=5120,
2025-05-07T20:31:44.6433162Z     contiguous=True,
2025-05-07T20:31:44.6433388Z     compiled=False,
2025-05-07T20:31:44.6433591Z )
2025-05-07T20:31:44.6433782Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6434157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6434541Z     T=128,
2025-05-07T20:31:44.6434724Z     D=7168,
2025-05-07T20:31:44.6434924Z     contiguous=True,
2025-05-07T20:31:44.6435150Z     compiled=True,
2025-05-07T20:31:44.6435348Z )
2025-05-07T20:31:44.6435543Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6435918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6436296Z     T=1,
2025-05-07T20:31:44.6436480Z     D=5120,
2025-05-07T20:31:44.6436677Z     contiguous=False,
2025-05-07T20:31:44.6436903Z     compiled=True,
2025-05-07T20:31:44.6437111Z )
2025-05-07T20:31:44.6437307Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6437681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6438165Z     T=4096,
2025-05-07T20:31:44.6438360Z     D=7168,
2025-05-07T20:31:44.6438557Z     contiguous=True,
2025-05-07T20:31:44.6438778Z     compiled=False,
2025-05-07T20:31:44.6438985Z )
2025-05-07T20:31:44.6439183Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6439553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6439938Z     T=4096,
2025-05-07T20:31:44.6440128Z     D=7168,
2025-05-07T20:31:44.6440312Z     contiguous=False,
2025-05-07T20:31:44.6440538Z     compiled=True,
2025-05-07T20:31:44.6440747Z )
2025-05-07T20:31:44.6440939Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6441314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6441707Z     T=128,
2025-05-07T20:31:44.6441888Z     D=5120,
2025-05-07T20:31:44.6442082Z     contiguous=True,
2025-05-07T20:31:44.6442305Z     compiled=False,
2025-05-07T20:31:44.6442504Z )
2025-05-07T20:31:44.6442697Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6443074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6443449Z     T=128,
2025-05-07T20:31:44.6443635Z     D=5120,
2025-05-07T20:31:44.6443834Z     contiguous=False,
2025-05-07T20:31:44.6444061Z     compiled=False,
2025-05-07T20:31:44.6444257Z )
2025-05-07T20:31:44.6444450Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6444822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6445195Z     T=1,
2025-05-07T20:31:44.6445376Z     D=5120,
2025-05-07T20:31:44.6445569Z     contiguous=True,
2025-05-07T20:31:44.6445785Z     compiled=False,
2025-05-07T20:31:44.6445986Z )
2025-05-07T20:31:44.6446181Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6446642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6447024Z     T=2048,
2025-05-07T20:31:44.6447212Z     D=7168,
2025-05-07T20:31:44.6447399Z     contiguous=False,
2025-05-07T20:31:44.6447621Z     compiled=True,
2025-05-07T20:31:44.6447830Z )
2025-05-07T20:31:44.6448017Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6448389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6448772Z     T=2048,
2025-05-07T20:31:44.6448951Z     D=7168,
2025-05-07T20:31:44.6449146Z     contiguous=False,
2025-05-07T20:31:44.6449371Z     compiled=False,
2025-05-07T20:31:44.6449567Z )
2025-05-07T20:31:44.6449761Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6450135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6450516Z     T=16384,
2025-05-07T20:31:44.6450705Z     D=7168,
2025-05-07T20:31:44.6450904Z     contiguous=False,
2025-05-07T20:31:44.6451130Z     compiled=True,
2025-05-07T20:31:44.6451332Z )
2025-05-07T20:31:44.6451531Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6451977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6452353Z     T=16384,
2025-05-07T20:31:44.6452547Z     D=7168,
2025-05-07T20:31:44.6452746Z     contiguous=True,
2025-05-07T20:31:44.6452962Z     compiled=True,
2025-05-07T20:31:44.6453166Z )
2025-05-07T20:31:44.6453364Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6453731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6454113Z     T=4096,
2025-05-07T20:31:44.6454301Z     D=7168,
2025-05-07T20:31:44.6454487Z     contiguous=True,
2025-05-07T20:31:44.6454706Z     compiled=True,
2025-05-07T20:31:44.6454909Z )
2025-05-07T20:31:44.6455098Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6455469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6455853Z     T=2048,
2025-05-07T20:31:44.6456039Z     D=5120,
2025-05-07T20:31:44.6456232Z     contiguous=False,
2025-05-07T20:31:44.6456455Z     compiled=False,
2025-05-07T20:31:44.6456659Z )
2025-05-07T20:31:44.6456848Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6457322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6457723Z     T=2048,
2025-05-07T20:31:44.6457931Z     D=5120,
2025-05-07T20:31:44.6458125Z     contiguous=True,
2025-05-07T20:31:44.6458348Z     compiled=False,
2025-05-07T20:31:44.6458548Z )
2025-05-07T20:31:44.6458746Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6459119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6459493Z     T=128,
2025-05-07T20:31:44.6459682Z     D=7168,
2025-05-07T20:31:44.6459878Z     contiguous=False,
2025-05-07T20:31:44.6460095Z     compiled=True,
2025-05-07T20:31:44.6460296Z )
2025-05-07T20:31:44.6460491Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6460860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6461246Z     T=16384,
2025-05-07T20:31:44.6461441Z     D=5120,
2025-05-07T20:31:44.6461627Z     contiguous=True,
2025-05-07T20:31:44.6461848Z     compiled=True,
2025-05-07T20:31:44.6462052Z )
2025-05-07T20:31:44.6462249Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6462625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6463009Z     T=2048,
2025-05-07T20:31:44.6463198Z     D=5120,
2025-05-07T20:31:44.6463386Z     contiguous=False,
2025-05-07T20:31:44.6463610Z     compiled=True,
2025-05-07T20:31:44.6463812Z )
2025-05-07T20:31:44.6464001Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6464376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6464756Z     T=16384,
2025-05-07T20:31:44.6464942Z     D=5120,
2025-05-07T20:31:44.6465141Z     contiguous=True,
2025-05-07T20:31:44.6465367Z     compiled=False,
2025-05-07T20:31:44.6465567Z )
2025-05-07T20:31:44.6465764Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6466256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6466631Z     T=16384,
2025-05-07T20:31:44.6466826Z     D=7168,
2025-05-07T20:31:44.6467021Z     contiguous=False,
2025-05-07T20:31:44.6467244Z     compiled=False,
2025-05-07T20:31:44.6467446Z )
2025-05-07T20:31:44.6467641Z Trying example: test_silu_mul(
2025-05-07T20:31:44.6468026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:44.6468400Z     T=16384,
2025-05-07T20:31:44.6468597Z     D=7168,
2025-05-07T20:31:44.6468793Z     contiguous=True,
2025-05-07T20:31:44.6469012Z     compiled=False,
2025-05-07T20:31:44.6469217Z )
2025-05-07T20:31:44.6469395Z PASSED
2025-05-07T20:31:44.7053068Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:44.7054191Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:44.7055625Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:44.7057132Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:44.7058138Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7059493Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:44.7060946Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7062331Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7063623Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:44.7065063Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7066176Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7067523Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:44.7068823Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:44.7070100Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:44.7071365Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:44.7072233Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7073456Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:44.7074519Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:44.7075351Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:44.7076617Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:44.7077959Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:44.7079135Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:44.7080224Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:44.7081459Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:44.7082880Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:44.7083991Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7084941Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7085834Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:44.7086898Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7214723Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:44.7215826Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:44.7217196Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:44.7218693Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:44.7219698Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7221040Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:44.7222465Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7223743Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7225019Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:44.7226441Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7227534Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7228854Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:44.7230150Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:44.7231406Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:44.7232655Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:44.7233505Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7234562Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:44.7235790Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:44.7236609Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:44.7237859Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:44.7239182Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:44.7240338Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:44.7241412Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:44.7242627Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:44.7244027Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:44.7245118Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7246055Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7246890Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:44.7247992Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7604745Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:44.7605854Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:44.7607441Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:44.7608959Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:44.7609968Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7611318Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:44.7612847Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7613869Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7616345Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:44.7617791Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7618890Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7620217Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:44.7621520Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:44.7622784Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:44.7624039Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:44.7624888Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7625944Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:44.7627140Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:44.7627953Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:44.7629197Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:44.7630524Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:44.7631681Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:44.7632763Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:44.7633986Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:44.7635394Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:44.7636495Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7637436Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7638190Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:44.7639329Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7648809Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:44.7649962Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:44.7651354Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:44.7652901Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:44.7653929Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7655299Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:44.7656751Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7657779Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7659248Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:44.7660693Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7661805Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7663152Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:44.7664457Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:44.7665740Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:44.7666997Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:44.7667862Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:44.7668930Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:44.7669994Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:44.7670819Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:44.7672162Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:44.7673501Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:44.7674669Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:44.7675756Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:44.7676984Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:44.7678411Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:44.7679525Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7680479Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7681249Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:44.7682309Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1797394Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1798383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1798894Z     T=1,
2025-05-07T20:31:45.1799077Z     D=5120,
2025-05-07T20:31:45.1799265Z     scale_ub=None,
2025-05-07T20:31:45.1799480Z     contiguous=True,
2025-05-07T20:31:45.1799711Z     compiled=True,
2025-05-07T20:31:45.1799916Z )
2025-05-07T20:31:45.1800250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.1800752Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.1801021Z 
2025-05-07T20:31:45.1801110Z     @given(
2025-05-07T20:31:45.1801341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.1801668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.1801997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.1802328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.1802673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.1802967Z     )
2025-05-07T20:31:45.1803321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.1803782Z     def test_silu_mul_quant(
2025-05-07T20:31:45.1804032Z         self,
2025-05-07T20:31:45.1804232Z         T: int,
2025-05-07T20:31:45.1804426Z         D: int,
2025-05-07T20:31:45.1804648Z         scale_ub: Optional[float],
2025-05-07T20:31:45.1804929Z         contiguous: bool,
2025-05-07T20:31:45.1805167Z         compiled: bool,
2025-05-07T20:31:45.1805409Z     ) -> None:
2025-05-07T20:31:45.1805632Z         torch.manual_seed(2025)
2025-05-07T20:31:45.1805874Z     
2025-05-07T20:31:45.1806409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.1806807Z     
2025-05-07T20:31:45.1807006Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.1807327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.1807674Z         x = x_sign * x_clamp
2025-05-07T20:31:45.1808267Z         x0 = x[:, :D]
2025-05-07T20:31:45.1808493Z         x1 = x[:, D:]
2025-05-07T20:31:45.1808703Z     
2025-05-07T20:31:45.1808889Z         if contiguous:
2025-05-07T20:31:45.1809124Z             x0 = x0.contiguous()
2025-05-07T20:31:45.1809386Z             x1 = x1.contiguous()
2025-05-07T20:31:45.1809620Z     
2025-05-07T20:31:45.1809819Z         if scale_ub is not None:
2025-05-07T20:31:45.1810093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.1810434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.1810744Z             )
2025-05-07T20:31:45.1810941Z         else:
2025-05-07T20:31:45.1811152Z             scale_ub_tensor = None
2025-05-07T20:31:45.1811399Z     
2025-05-07T20:31:45.1811641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1819106Z             op = silu_mul_quant
2025-05-07T20:31:45.1819414Z             if compiled:
2025-05-07T20:31:45.1819676Z                 op = torch.compile(op)
2025-05-07T20:31:45.1819990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1820268Z     
2025-05-07T20:31:45.1820468Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.1820769Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.1821064Z     
2025-05-07T20:31:45.1821315Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1821663Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.1821962Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.1822287Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.1822659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.1822972Z     
2025-05-07T20:31:45.1823180Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.1823600Z 
2025-05-07T20:31:45.1823706Z moe/activation_test.py:126: 
2025-05-07T20:31:45.1824019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1824364Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.1824704Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.1825523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.1826295Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.1826864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.1827562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.1828267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.1829017Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.1829770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.1830434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.1831051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.1831587Z     fn()
2025-05-07T20:31:45.1832110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.1832702Z     self.fn.run(
2025-05-07T20:31:45.1833186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.1833733Z     kernel = self.compile(
2025-05-07T20:31:45.1834290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.1834963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.1835374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1835697Z 
2025-05-07T20:31:45.1835918Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484d7f530>
2025-05-07T20:31:45.1837040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.1838478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486d9fba0>}
2025-05-07T20:31:45.1839875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.1840948Z context = <triton._C.libtriton.ir.context object at 0x7f4484de7570>
2025-05-07T20:31:45.1841243Z 
2025-05-07T20:31:45.1841426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1841958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1842437Z                            module_map=module_map)
2025-05-07T20:31:45.1842809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1843174Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.1843441Z E       ^
2025-05-07T20:31:45.1843920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1844387Z 
2025-05-07T20:31:45.1844826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1845441Z 
2025-05-07T20:31:45.1845551Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1845969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1846385Z     T=2048,
2025-05-07T20:31:45.1846582Z     D=5120,
2025-05-07T20:31:45.1846771Z     scale_ub=1200.0,
2025-05-07T20:31:45.1847002Z     contiguous=True,
2025-05-07T20:31:45.1847229Z     compiled=False,
2025-05-07T20:31:45.1847434Z )
2025-05-07T20:31:45.4723762Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:45.4725212Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:45.4726607Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:45.4728148Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:45.4729154Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4730506Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:45.4732027Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4733062Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4734697Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.4736138Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4737245Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4738579Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:45.4739895Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:45.4741171Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:45.4742429Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:45.4743291Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4744361Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:45.4745573Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:45.4746586Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:45.4747849Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:45.4749185Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:45.4750346Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:45.4751437Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:45.4752663Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:45.4754073Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:45.4755174Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4756118Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4756883Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:45.4758023Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5523703Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:45.5524902Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:45.5526281Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:45.5527764Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:45.5528795Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5530139Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:45.5531568Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5532665Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5534278Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.5535709Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5536800Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5538127Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:45.5539428Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:45.5540707Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:45.5541966Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:45.5542820Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5543880Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:45.5544935Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:45.5545759Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:45.5547170Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:45.5548502Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:45.5549660Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:45.5550739Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:45.5551967Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:45.5553366Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:45.5554465Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5555404Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5556167Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:45.5557214Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.7828203Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:45.7829343Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:45.7830726Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:45.7832304Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:45.7833323Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7834670Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:45.7836100Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.7837111Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7838381Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.7840144Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.7841240Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7842562Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:45.7843855Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:45.7845123Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:45.7846377Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:45.7847222Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7848278Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:45.7849329Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:45.7850143Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:45.7851528Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:45.7852936Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:45.7854095Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:45.7855171Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:45.7856387Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:45.7857793Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:45.7858887Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.7859824Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.7860594Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:45.7861651Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.7930837Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:45.7932314Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:45.7933705Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:45.7935182Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:45.7936188Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7937539Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:45.7938968Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.7939982Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7941255Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.7942802Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.7943902Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7945224Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:45.7946520Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:45.7947787Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:45.7949099Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:45.7949954Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.7951012Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:45.7952067Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:45.7952886Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:45.7954143Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:45.7955550Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:45.7956704Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:45.7957779Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:45.7959045Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:45.7960456Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:45.7961550Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.7962493Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.7963254Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:45.7964302Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1344517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1345736Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.1346026Z 
2025-05-07T20:31:46.1346110Z     @given(
2025-05-07T20:31:46.1346350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1346688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.1346995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.1347341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1347679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1347965Z     )
2025-05-07T20:31:46.1348328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1348783Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1349032Z         self,
2025-05-07T20:31:46.1349226Z         T: int,
2025-05-07T20:31:46.1349424Z         D: int,
2025-05-07T20:31:46.1349645Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1349916Z         contiguous: bool,
2025-05-07T20:31:46.1350177Z         compiled: bool,
2025-05-07T20:31:46.1350410Z     ) -> None:
2025-05-07T20:31:46.1350625Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1350872Z     
2025-05-07T20:31:46.1351156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1351498Z     
2025-05-07T20:31:46.1351695Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1351990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1352300Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1352545Z         x0 = x[:, :D]
2025-05-07T20:31:46.1352767Z         x1 = x[:, D:]
2025-05-07T20:31:46.1352975Z     
2025-05-07T20:31:46.1353166Z         if contiguous:
2025-05-07T20:31:46.1353404Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1353661Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1353904Z     
2025-05-07T20:31:46.1354098Z         if scale_ub is not None:
2025-05-07T20:31:46.1354368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1354716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1355030Z             )
2025-05-07T20:31:46.1355228Z         else:
2025-05-07T20:31:46.1355438Z             scale_ub_tensor = None
2025-05-07T20:31:46.1355695Z     
2025-05-07T20:31:46.1356093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1356415Z             op = silu_mul_quant
2025-05-07T20:31:46.1356675Z             if compiled:
2025-05-07T20:31:46.1356933Z                 op = torch.compile(op)
2025-05-07T20:31:46.1357232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1357514Z     
2025-05-07T20:31:46.1357714Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.1357881Z 
2025-05-07T20:31:46.1357983Z moe/activation_test.py:117: 
2025-05-07T20:31:46.1358290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1358629Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.1358915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1359632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.1360347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.1360905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1361602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1362290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1362839Z     kernel = self.compile(
2025-05-07T20:31:46.1363397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1364066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1364477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1364799Z 
2025-05-07T20:31:46.1365018Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487688a70>
2025-05-07T20:31:46.1366139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1367569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44857e9620>}
2025-05-07T20:31:46.1368957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1370033Z context = <triton._C.libtriton.ir.context object at 0x7f4484c2bb70>
2025-05-07T20:31:46.1370333Z 
2025-05-07T20:31:46.1370508Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1371054Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1371544Z                            module_map=module_map)
2025-05-07T20:31:46.1372007Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1372368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.1372624Z E       ^
2025-05-07T20:31:46.1373101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1373565Z 
2025-05-07T20:31:46.1374000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1374528Z 
2025-05-07T20:31:46.1374637Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1375057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1375471Z     T=2048,
2025-05-07T20:31:46.1375676Z     D=5120,
2025-05-07T20:31:46.1375864Z     scale_ub=1200.0,
2025-05-07T20:31:46.1376089Z     contiguous=True,
2025-05-07T20:31:46.1376312Z     compiled=True,
2025-05-07T20:31:46.1376514Z )
2025-05-07T20:31:46.1376929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1377439Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.1377717Z 
2025-05-07T20:31:46.1377793Z     @given(
2025-05-07T20:31:46.1378028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1378343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.1378654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.1378981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1379313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1379615Z     )
2025-05-07T20:31:46.1379964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1380425Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1380671Z         self,
2025-05-07T20:31:46.1380864Z         T: int,
2025-05-07T20:31:46.1381066Z         D: int,
2025-05-07T20:31:46.1381294Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1381561Z         contiguous: bool,
2025-05-07T20:31:46.1381809Z         compiled: bool,
2025-05-07T20:31:46.1382036Z     ) -> None:
2025-05-07T20:31:46.1382248Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1382496Z     
2025-05-07T20:31:46.1382774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1383118Z     
2025-05-07T20:31:46.1383315Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1383615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1383933Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1384169Z         x0 = x[:, :D]
2025-05-07T20:31:46.1384390Z         x1 = x[:, D:]
2025-05-07T20:31:46.1384601Z     
2025-05-07T20:31:46.1384869Z         if contiguous:
2025-05-07T20:31:46.1385101Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1385361Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1385596Z     
2025-05-07T20:31:46.1385790Z         if scale_ub is not None:
2025-05-07T20:31:46.1386073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1386406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1386721Z             )
2025-05-07T20:31:46.1386917Z         else:
2025-05-07T20:31:46.1387122Z             scale_ub_tensor = None
2025-05-07T20:31:46.1387375Z     
2025-05-07T20:31:46.1387609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1387923Z             op = silu_mul_quant
2025-05-07T20:31:46.1388181Z             if compiled:
2025-05-07T20:31:46.1388433Z                 op = torch.compile(op)
2025-05-07T20:31:46.1388730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1389000Z     
2025-05-07T20:31:46.1389192Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.1389485Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.1389774Z     
2025-05-07T20:31:46.1390017Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1390360Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.1390653Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.1390977Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.1391348Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.1391658Z     
2025-05-07T20:31:46.1391861Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.1392067Z 
2025-05-07T20:31:46.1392170Z moe/activation_test.py:126: 
2025-05-07T20:31:46.1392476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1392818Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.1393153Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.1393963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.1394739Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.1395381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1396089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1396802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.1397541Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.1398297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.1398958Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.1399581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.1400112Z     fn()
2025-05-07T20:31:46.1400640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.1401243Z     self.fn.run(
2025-05-07T20:31:46.1401716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1402262Z     kernel = self.compile(
2025-05-07T20:31:46.1402818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1403495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1403901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1404142Z 
2025-05-07T20:31:46.1404353Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484df52b0>
2025-05-07T20:31:46.1405472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1407374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f448577e980>}
2025-05-07T20:31:46.1408772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1409835Z context = <triton._C.libtriton.ir.context object at 0x7f4484783670>
2025-05-07T20:31:46.1410136Z 
2025-05-07T20:31:46.1410313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1410851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1411330Z                            module_map=module_map)
2025-05-07T20:31:46.1411702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1412158Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.1412429Z E       ^
2025-05-07T20:31:46.1412908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1413377Z 
2025-05-07T20:31:46.1413806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1414338Z 
2025-05-07T20:31:46.1414448Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1414866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1415279Z     T=16384,
2025-05-07T20:31:46.1415469Z     D=7168,
2025-05-07T20:31:46.1415658Z     scale_ub=1200.0,
2025-05-07T20:31:46.1415884Z     contiguous=False,
2025-05-07T20:31:46.1416122Z     compiled=False,
2025-05-07T20:31:46.1416328Z )
2025-05-07T20:31:46.3239807Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:46.3242047Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:46.3244826Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:46.3247777Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:46.3248988Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3250373Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:46.3251895Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3260720Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3262019Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:46.3263681Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3264789Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3266119Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:46.3267416Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:46.3268682Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:46.3269947Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:46.3270796Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3271857Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:46.3272915Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:46.3273743Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:46.3274998Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:46.3276450Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:46.3277616Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:46.3278699Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:46.3279923Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:46.3281329Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:46.3282435Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3283378Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3284144Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:46.3285186Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3817498Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:46.3818895Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:46.3820272Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:46.3821744Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:46.3822742Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3824086Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:46.3825522Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3826533Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3827797Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:46.3829209Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3830446Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3831763Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:46.3833053Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:46.3834321Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:46.3835579Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:46.3836435Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.3837501Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:46.3838556Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:46.3839385Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:46.3840634Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:46.3842049Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:46.3843209Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:46.3844292Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:46.3845514Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:46.3846915Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:46.3848026Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3848973Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3849741Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:46.3850792Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.5708474Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:46.5711377Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:46.5714632Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:46.5717603Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:46.5718910Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5720267Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:46.5721717Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.5722739Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5724008Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:46.5725431Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.5726531Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5728023Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:46.5729316Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:46.5730579Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:46.5731915Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:46.5732771Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5733835Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:46.5734896Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:46.5735722Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:46.5736973Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:46.5738301Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:46.5739582Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:46.5740669Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:46.5741896Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:46.5743303Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:46.5744403Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.5745363Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.5746138Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:46.5747190Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.5801414Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:46.5802786Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:46.5804305Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:46.5805785Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:46.5807126Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5808508Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:46.5809972Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.5811011Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5812357Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:46.5813785Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.5814878Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5816368Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:46.5817658Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:46.5818969Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:46.5820222Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:46.5821071Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:46.5822138Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:46.5823197Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:46.5824016Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:46.5825263Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:46.5826590Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:46.5827877Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:46.5828964Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:46.5830186Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:46.5831591Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:46.5832693Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.5833642Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.5834411Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:46.5835464Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.3271399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.3272148Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:47.3272561Z 
2025-05-07T20:31:47.3272646Z     @given(
2025-05-07T20:31:47.3272887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.3273206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.3273522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.3273890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.3274232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.3274524Z     )
2025-05-07T20:31:47.3275400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.3275977Z     def test_silu_mul_quant(
2025-05-07T20:31:47.3276229Z         self,
2025-05-07T20:31:47.3276440Z         T: int,
2025-05-07T20:31:47.3276649Z         D: int,
2025-05-07T20:31:47.3276872Z         scale_ub: Optional[float],
2025-05-07T20:31:47.3277158Z         contiguous: bool,
2025-05-07T20:31:47.3277415Z         compiled: bool,
2025-05-07T20:31:47.3277649Z     ) -> None:
2025-05-07T20:31:47.3277875Z         torch.manual_seed(2025)
2025-05-07T20:31:47.3278133Z     
2025-05-07T20:31:47.3278436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.3278817Z     
2025-05-07T20:31:47.3279022Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.3279325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.3279650Z         x = x_sign * x_clamp
2025-05-07T20:31:47.3279901Z         x0 = x[:, :D]
2025-05-07T20:31:47.3280129Z         x1 = x[:, D:]
2025-05-07T20:31:47.3280339Z     
2025-05-07T20:31:47.3280539Z         if contiguous:
2025-05-07T20:31:47.3280811Z             x0 = x0.contiguous()
2025-05-07T20:31:47.3281081Z             x1 = x1.contiguous()
2025-05-07T20:31:47.3281320Z     
2025-05-07T20:31:47.3281523Z         if scale_ub is not None:
2025-05-07T20:31:47.3281806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.3282152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.3282466Z             )
2025-05-07T20:31:47.3282668Z         else:
2025-05-07T20:31:47.3282888Z             scale_ub_tensor = None
2025-05-07T20:31:47.3283145Z     
2025-05-07T20:31:47.3283383Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.3283713Z             op = silu_mul_quant
2025-05-07T20:31:47.3284142Z             if compiled:
2025-05-07T20:31:47.3284399Z                 op = torch.compile(op)
2025-05-07T20:31:47.3284704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.3284983Z     
2025-05-07T20:31:47.3285190Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.3285363Z 
2025-05-07T20:31:47.3285474Z moe/activation_test.py:117: 
2025-05-07T20:31:47.3285778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3286122Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.3286413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.3287141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.3287860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.3288427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.3289200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.3289890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.3290456Z     kernel = self.compile(
2025-05-07T20:31:47.3291019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.3291704Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.3292205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3292452Z 
2025-05-07T20:31:47.3292680Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d546e0>
2025-05-07T20:31:47.3293812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.3295265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44849e1620>}
2025-05-07T20:31:47.3296751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.3297812Z context = <triton._C.libtriton.ir.context object at 0x7f4477d2c930>
2025-05-07T20:31:47.3298119Z 
2025-05-07T20:31:47.3298293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.3298859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.3299378Z                            module_map=module_map)
2025-05-07T20:31:47.3299751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.3300120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.3300396Z E       ^
2025-05-07T20:31:47.3300870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.3301344Z 
2025-05-07T20:31:47.3301783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.3302323Z 
2025-05-07T20:31:47.3302432Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.3302871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.3303286Z     T=1,
2025-05-07T20:31:47.3303480Z     D=7168,
2025-05-07T20:31:47.3303686Z     scale_ub=None,
2025-05-07T20:31:47.3303908Z     contiguous=True,
2025-05-07T20:31:47.3304141Z     compiled=True,
2025-05-07T20:31:47.3304355Z )
2025-05-07T20:31:47.3304682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.3305182Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:47.3305544Z 
2025-05-07T20:31:47.3305625Z     @given(
2025-05-07T20:31:47.3305868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.3306511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.3306842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.3307183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.3307517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.3307812Z     )
2025-05-07T20:31:47.3308172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.3308625Z     def test_silu_mul_quant(
2025-05-07T20:31:47.3308879Z         self,
2025-05-07T20:31:47.3309085Z         T: int,
2025-05-07T20:31:47.3309286Z         D: int,
2025-05-07T20:31:47.3309516Z         scale_ub: Optional[float],
2025-05-07T20:31:47.3309798Z         contiguous: bool,
2025-05-07T20:31:47.3310041Z         compiled: bool,
2025-05-07T20:31:47.3310281Z     ) -> None:
2025-05-07T20:31:47.3310504Z         torch.manual_seed(2025)
2025-05-07T20:31:47.3310754Z     
2025-05-07T20:31:47.3311030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.3311385Z     
2025-05-07T20:31:47.3311594Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.3311891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.3312211Z         x = x_sign * x_clamp
2025-05-07T20:31:47.3312467Z         x0 = x[:, :D]
2025-05-07T20:31:47.3312688Z         x1 = x[:, D:]
2025-05-07T20:31:47.3312907Z     
2025-05-07T20:31:47.3313103Z         if contiguous:
2025-05-07T20:31:47.3313339Z             x0 = x0.contiguous()
2025-05-07T20:31:47.3313610Z             x1 = x1.contiguous()
2025-05-07T20:31:47.3313863Z     
2025-05-07T20:31:47.3314059Z         if scale_ub is not None:
2025-05-07T20:31:47.3314339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.3314693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.3315012Z             )
2025-05-07T20:31:47.3315215Z         else:
2025-05-07T20:31:47.3315434Z             scale_ub_tensor = None
2025-05-07T20:31:47.3315685Z     
2025-05-07T20:31:47.3316064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.3316398Z             op = silu_mul_quant
2025-05-07T20:31:47.3316659Z             if compiled:
2025-05-07T20:31:47.3316913Z                 op = torch.compile(op)
2025-05-07T20:31:47.3317227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.3317517Z     
2025-05-07T20:31:47.3317711Z         y_fp8, y_scale = fn()
2025-05-07T20:31:47.3318006Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:47.3318314Z     
2025-05-07T20:31:47.3318554Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.3318900Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:47.3319210Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:47.3319533Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:47.3319913Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:47.3320238Z     
2025-05-07T20:31:47.3320444Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:47.3320651Z 
2025-05-07T20:31:47.3320757Z moe/activation_test.py:126: 
2025-05-07T20:31:47.3321066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3321416Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:47.3321748Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:47.3322568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:47.3323356Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:47.3323924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.3324624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.3325467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:47.3326220Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:47.3326972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:47.3327638Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:47.3328262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:47.3328848Z     fn()
2025-05-07T20:31:47.3329370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:47.3329978Z     self.fn.run(
2025-05-07T20:31:47.3330462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.3331019Z     kernel = self.compile(
2025-05-07T20:31:47.3331571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.3332334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.3332751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3332990Z 
2025-05-07T20:31:47.3333205Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d1cef0>
2025-05-07T20:31:47.3334333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.3335764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486da8b80>}
2025-05-07T20:31:47.3337258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.3338322Z context = <triton._C.libtriton.ir.context object at 0x7f4484296970>
2025-05-07T20:31:47.3338636Z 
2025-05-07T20:31:47.3338836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.3339382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.3339870Z                            module_map=module_map)
2025-05-07T20:31:47.3340242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.3340615Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:47.3340894Z E       ^
2025-05-07T20:31:47.3341383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.3341855Z 
2025-05-07T20:31:47.3342287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.3342826Z 
2025-05-07T20:31:47.3342941Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.3343372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.3343793Z     T=4096,
2025-05-07T20:31:47.3343983Z     D=5120,
2025-05-07T20:31:47.3344182Z     scale_ub=None,
2025-05-07T20:31:47.3344407Z     contiguous=False,
2025-05-07T20:31:47.3344636Z     compiled=False,
2025-05-07T20:31:47.3344846Z )
2025-05-07T20:31:47.6222042Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:47.6223232Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:47.6225053Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:47.6226555Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:47.6227566Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.6228974Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:47.6230420Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.6231447Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.6232743Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:47.6234175Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.6235284Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.6236774Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:47.6238079Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:47.6239402Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:47.6240648Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:47.6241510Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.6242584Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:47.6243645Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:47.6244465Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:47.6245723Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:47.6247056Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:47.6248306Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:47.6249444Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:47.6250665Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:47.6252155Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:47.6253253Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.6254205Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.6254979Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:47.6256028Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.8210579Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:47.8212868Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:47.8215630Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:47.8218882Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:47.8219888Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.8221247Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:47.8222690Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.8223724Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.8225010Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:47.8226442Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.8227551Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.8228888Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:47.8230348Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:47.8231617Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:47.8232878Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:47.8233733Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:47.8234796Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:47.8235862Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:47.8236686Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:47.8237942Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:47.8239276Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:47.8240436Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:47.8241526Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:47.8242826Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:47.8244236Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:47.8245342Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.8246288Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.8247061Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:47.8248117Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.1123153Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:48.1124539Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:48.1125959Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:48.1127827Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:48.1128871Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1130235Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:48.1131694Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.1132826Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1134135Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:48.1135593Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.1136717Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1138072Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:48.1139399Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:48.1140836Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:48.1142122Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:48.1142993Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1144081Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:48.1145165Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:48.1146016Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:48.1147289Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:48.1148620Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:48.1149804Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:48.1150899Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:48.1152224Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:48.1153648Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:48.1154764Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.1155722Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.1156501Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:48.1157572Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.1220224Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:48.1221520Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:48.1222910Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:48.1224386Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:48.1225575Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1226934Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:48.1228374Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.1229394Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1230667Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:48.1232110Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.1233207Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1234535Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:48.1235835Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:48.1237239Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:48.1238504Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:48.1239417Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:48.1240488Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:48.1241555Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:48.1242392Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:48.1243665Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:48.1244999Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:48.1246169Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:48.1247262Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:48.1248494Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:48.1250453Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:48.1251568Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.1252605Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.1253379Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:48.1254449Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4725181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4725992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4726388Z 
2025-05-07T20:31:49.4726498Z     @given(
2025-05-07T20:31:49.4726821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4727146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4727463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4727803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4728133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4728425Z     )
2025-05-07T20:31:49.4728784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4729288Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4729528Z         self,
2025-05-07T20:31:49.4730110Z         T: int,
2025-05-07T20:31:49.4730311Z         D: int,
2025-05-07T20:31:49.4730529Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4730808Z         contiguous: bool,
2025-05-07T20:31:49.4731057Z         compiled: bool,
2025-05-07T20:31:49.4731284Z     ) -> None:
2025-05-07T20:31:49.4731504Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4731851Z     
2025-05-07T20:31:49.4732158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4732656Z     
2025-05-07T20:31:49.4732887Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4733242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4733567Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4733813Z         x0 = x[:, :D]
2025-05-07T20:31:49.4734028Z         x1 = x[:, D:]
2025-05-07T20:31:49.4734259Z     
2025-05-07T20:31:49.4734755Z         if contiguous:
2025-05-07T20:31:49.4735047Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4735306Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4735562Z     
2025-05-07T20:31:49.4735754Z         if scale_ub is not None:
2025-05-07T20:31:49.4736025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4736379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4736695Z             )
2025-05-07T20:31:49.4736891Z         else:
2025-05-07T20:31:49.4737106Z             scale_ub_tensor = None
2025-05-07T20:31:49.4737362Z     
2025-05-07T20:31:49.4737594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4737917Z             op = silu_mul_quant
2025-05-07T20:31:49.4738169Z             if compiled:
2025-05-07T20:31:49.4738413Z                 op = torch.compile(op)
2025-05-07T20:31:49.4738718Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4738997Z     
2025-05-07T20:31:49.4739192Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4739359Z 
2025-05-07T20:31:49.4739461Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4739767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4740106Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4740386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4741294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4742017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4742568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4743266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4743957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4744508Z     kernel = self.compile(
2025-05-07T20:31:49.4745059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4745745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4746154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4746389Z 
2025-05-07T20:31:49.4746614Z self = <triton.compiler.compiler.ASTSource object at 0x7f44849c7b90>
2025-05-07T20:31:49.4747729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4749293Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7c180>}
2025-05-07T20:31:49.4750890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4752055Z context = <triton._C.libtriton.ir.context object at 0x7f44776517b0>
2025-05-07T20:31:49.4752351Z 
2025-05-07T20:31:49.4752532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4753066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4753545Z                            module_map=module_map)
2025-05-07T20:31:49.4753918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4754278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4754537Z E       ^
2025-05-07T20:31:49.4755015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4755479Z 
2025-05-07T20:31:49.4755917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4756449Z 
2025-05-07T20:31:49.4756560Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4756977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4757391Z     T=4096,
2025-05-07T20:31:49.4757587Z     D=7168,
2025-05-07T20:31:49.4757777Z     scale_ub=None,
2025-05-07T20:31:49.4758002Z     contiguous=False,
2025-05-07T20:31:49.4758235Z     compiled=False,
2025-05-07T20:31:49.4758438Z )
2025-05-07T20:31:49.4758764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4759280Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4759560Z 
2025-05-07T20:31:49.4759637Z     @given(
2025-05-07T20:31:49.4759871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4760193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4760504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4760835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4761174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4761462Z     )
2025-05-07T20:31:49.4761812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4762378Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4762645Z         self,
2025-05-07T20:31:49.4762847Z         T: int,
2025-05-07T20:31:49.4763055Z         D: int,
2025-05-07T20:31:49.4763277Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4763545Z         contiguous: bool,
2025-05-07T20:31:49.4763788Z         compiled: bool,
2025-05-07T20:31:49.4764012Z     ) -> None:
2025-05-07T20:31:49.4764227Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4764468Z     
2025-05-07T20:31:49.4764743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4765084Z     
2025-05-07T20:31:49.4765280Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4765574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4765894Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4766130Z         x0 = x[:, :D]
2025-05-07T20:31:49.4766347Z         x1 = x[:, D:]
2025-05-07T20:31:49.4766559Z     
2025-05-07T20:31:49.4766737Z         if contiguous:
2025-05-07T20:31:49.4766977Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4767237Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4767473Z     
2025-05-07T20:31:49.4767667Z         if scale_ub is not None:
2025-05-07T20:31:49.4767941Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4768275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4768587Z             )
2025-05-07T20:31:49.4768780Z         else:
2025-05-07T20:31:49.4769004Z             scale_ub_tensor = None
2025-05-07T20:31:49.4769260Z     
2025-05-07T20:31:49.4769495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4769813Z             op = silu_mul_quant
2025-05-07T20:31:49.4770070Z             if compiled:
2025-05-07T20:31:49.4770410Z                 op = torch.compile(op)
2025-05-07T20:31:49.4770712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4770986Z     
2025-05-07T20:31:49.4771184Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4771351Z 
2025-05-07T20:31:49.4771466Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4771838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4772227Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4772519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4773231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4773936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4774492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4775199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4775892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4776435Z     kernel = self.compile(
2025-05-07T20:31:49.4776996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4777674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4778079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4778319Z 
2025-05-07T20:31:49.4778527Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484c0e6c0>
2025-05-07T20:31:49.4779698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4781128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7cfe0>}
2025-05-07T20:31:49.4782614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4783673Z context = <triton._C.libtriton.ir.context object at 0x7f447757e630>
2025-05-07T20:31:49.4783976Z 
2025-05-07T20:31:49.4784146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4784688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4785174Z                            module_map=module_map)
2025-05-07T20:31:49.4785549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4785916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4786186Z E       ^
2025-05-07T20:31:49.4786665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4787133Z 
2025-05-07T20:31:49.4787567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4788105Z 
2025-05-07T20:31:49.4788209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4788630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4789039Z     T=128,
2025-05-07T20:31:49.4789251Z     D=7168,
2025-05-07T20:31:49.4789467Z     scale_ub=None,
2025-05-07T20:31:49.4789677Z     contiguous=False,
2025-05-07T20:31:49.4789903Z     compiled=True,
2025-05-07T20:31:49.4790112Z )
2025-05-07T20:31:49.5376096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5376860Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5377604Z 
2025-05-07T20:31:49.5377681Z     @given(
2025-05-07T20:31:49.5377913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5378227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5378540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5378870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5379205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5379533Z     )
2025-05-07T20:31:49.5379889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5380343Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5380580Z         self,
2025-05-07T20:31:49.5380782Z         T: int,
2025-05-07T20:31:49.5380983Z         D: int,
2025-05-07T20:31:49.5381199Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5381474Z         contiguous: bool,
2025-05-07T20:31:49.5381720Z         compiled: bool,
2025-05-07T20:31:49.5381948Z     ) -> None:
2025-05-07T20:31:49.5382172Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5382422Z     
2025-05-07T20:31:49.5382693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5383041Z     
2025-05-07T20:31:49.5383235Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5383534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5383843Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5384087Z         x0 = x[:, :D]
2025-05-07T20:31:49.5384304Z         x1 = x[:, D:]
2025-05-07T20:31:49.5384505Z     
2025-05-07T20:31:49.5384690Z         if contiguous:
2025-05-07T20:31:49.5384922Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5385173Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5385420Z     
2025-05-07T20:31:49.5385615Z         if scale_ub is not None:
2025-05-07T20:31:49.5385881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5386220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5386532Z             )
2025-05-07T20:31:49.5386727Z         else:
2025-05-07T20:31:49.5386940Z             scale_ub_tensor = None
2025-05-07T20:31:49.5387194Z     
2025-05-07T20:31:49.5387419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5387890Z             op = silu_mul_quant
2025-05-07T20:31:49.5388146Z             if compiled:
2025-05-07T20:31:49.5388400Z                 op = torch.compile(op)
2025-05-07T20:31:49.5388698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5388976Z     
2025-05-07T20:31:49.5389171Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.5389460Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.5389758Z     
2025-05-07T20:31:49.5390001Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5390335Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.5390638Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.5390958Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.5391323Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.5391642Z     
2025-05-07T20:31:49.5391852Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.5392050Z 
2025-05-07T20:31:49.5392163Z moe/activation_test.py:126: 
2025-05-07T20:31:49.5392464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5392809Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.5393148Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.5393961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.5394745Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.5395310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5396016Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5396817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.5397565Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.5398332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.5398997Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.5399661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.5400198Z     fn()
2025-05-07T20:31:49.5400721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.5401318Z     self.fn.run(
2025-05-07T20:31:49.5401801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5402352Z     kernel = self.compile(
2025-05-07T20:31:49.5402904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5403583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5403991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5404226Z 
2025-05-07T20:31:49.5404445Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484bc3380>
2025-05-07T20:31:49.5405555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5407290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7dee0>}
2025-05-07T20:31:49.5408692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5409934Z context = <triton._C.libtriton.ir.context object at 0x7f4477af2bb0>
2025-05-07T20:31:49.5410230Z 
2025-05-07T20:31:49.5410406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5410938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5411418Z                            module_map=module_map)
2025-05-07T20:31:49.5411854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5412214Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.5412485Z E       ^
2025-05-07T20:31:49.5412961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5413431Z 
2025-05-07T20:31:49.5413866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5414394Z 
2025-05-07T20:31:49.5414503Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5414927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5415341Z     T=128,
2025-05-07T20:31:49.5415523Z     D=7168,
2025-05-07T20:31:49.5415718Z     scale_ub=None,
2025-05-07T20:31:49.5415946Z     contiguous=False,
2025-05-07T20:31:49.5416176Z     compiled=False,
2025-05-07T20:31:49.5416378Z )
2025-05-07T20:31:49.7432555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.7433345Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.7433715Z 
2025-05-07T20:31:49.7433795Z     @given(
2025-05-07T20:31:49.7434033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.7434755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.7435066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.7435403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.7435737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.7436024Z     )
2025-05-07T20:31:49.7436382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.7436839Z     def test_silu_mul_quant(
2025-05-07T20:31:49.7437081Z         self,
2025-05-07T20:31:49.7437282Z         T: int,
2025-05-07T20:31:49.7437484Z         D: int,
2025-05-07T20:31:49.7437699Z         scale_ub: Optional[float],
2025-05-07T20:31:49.7437974Z         contiguous: bool,
2025-05-07T20:31:49.7438220Z         compiled: bool,
2025-05-07T20:31:49.7438445Z     ) -> None:
2025-05-07T20:31:49.7438666Z         torch.manual_seed(2025)
2025-05-07T20:31:49.7438913Z     
2025-05-07T20:31:49.7439183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.7439586Z     
2025-05-07T20:31:49.7439788Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.7440080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.7440399Z         x = x_sign * x_clamp
2025-05-07T20:31:49.7440653Z         x0 = x[:, :D]
2025-05-07T20:31:49.7440867Z         x1 = x[:, D:]
2025-05-07T20:31:49.7441078Z     
2025-05-07T20:31:49.7441265Z         if contiguous:
2025-05-07T20:31:49.7441494Z             x0 = x0.contiguous()
2025-05-07T20:31:49.7441756Z             x1 = x1.contiguous()
2025-05-07T20:31:49.7442001Z     
2025-05-07T20:31:49.7442185Z         if scale_ub is not None:
2025-05-07T20:31:49.7442461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.7442806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.7443121Z             )
2025-05-07T20:31:49.7443311Z         else:
2025-05-07T20:31:49.7443524Z             scale_ub_tensor = None
2025-05-07T20:31:49.7443782Z     
2025-05-07T20:31:49.7444016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.7444340Z             op = silu_mul_quant
2025-05-07T20:31:49.7444600Z             if compiled:
2025-05-07T20:31:49.7444847Z                 op = torch.compile(op)
2025-05-07T20:31:49.7445305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.7445590Z     
2025-05-07T20:31:49.7445780Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.7445981Z 
2025-05-07T20:31:49.7446083Z moe/activation_test.py:117: 
2025-05-07T20:31:49.7446393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.7454951Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.7455286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.7456015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.7456730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.7457281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.7457998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.7458694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.7459242Z     kernel = self.compile(
2025-05-07T20:31:49.7459811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.7460494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.7460912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.7461152Z 
2025-05-07T20:31:49.7461366Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484b6ff50>
2025-05-07T20:31:49.7462494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.7464058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477844cc0>}
2025-05-07T20:31:49.7465453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.7466521Z context = <triton._C.libtriton.ir.context object at 0x7f44773560b0>
2025-05-07T20:31:49.7466818Z 
2025-05-07T20:31:49.7466991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.7467534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.7468017Z                            module_map=module_map)
2025-05-07T20:31:49.7468393Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.7468755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.7469019Z E       ^
2025-05-07T20:31:49.7469505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.7469981Z 
2025-05-07T20:31:49.7470412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.7470944Z 
2025-05-07T20:31:49.7471057Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.7471478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.7471894Z     T=4096,
2025-05-07T20:31:49.7472093Z     D=5120,
2025-05-07T20:31:49.7472287Z     scale_ub=1200.0,
2025-05-07T20:31:49.7472518Z     contiguous=True,
2025-05-07T20:31:49.7472745Z     compiled=False,
2025-05-07T20:31:49.7472953Z )
2025-05-07T20:31:49.7473286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.7473807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.7474091Z 
2025-05-07T20:31:49.7474177Z     @given(
2025-05-07T20:31:49.7474494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.7474819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.7475137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.7475476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.7475820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.7476115Z     )
2025-05-07T20:31:49.7476469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.7476925Z     def test_silu_mul_quant(
2025-05-07T20:31:49.7477175Z         self,
2025-05-07T20:31:49.7477374Z         T: int,
2025-05-07T20:31:49.7477570Z         D: int,
2025-05-07T20:31:49.7477794Z         scale_ub: Optional[float],
2025-05-07T20:31:49.7478075Z         contiguous: bool,
2025-05-07T20:31:49.7478313Z         compiled: bool,
2025-05-07T20:31:49.7478549Z     ) -> None:
2025-05-07T20:31:49.7478766Z         torch.manual_seed(2025)
2025-05-07T20:31:49.7479006Z     
2025-05-07T20:31:49.7479290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.7479640Z     
2025-05-07T20:31:49.7479828Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.7480131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.7480445Z         x = x_sign * x_clamp
2025-05-07T20:31:49.7480683Z         x0 = x[:, :D]
2025-05-07T20:31:49.7480903Z         x1 = x[:, D:]
2025-05-07T20:31:49.7481113Z     
2025-05-07T20:31:49.7481296Z         if contiguous:
2025-05-07T20:31:49.7481530Z             x0 = x0.contiguous()
2025-05-07T20:31:49.7481796Z             x1 = x1.contiguous()
2025-05-07T20:31:49.7482031Z     
2025-05-07T20:31:49.7482226Z         if scale_ub is not None:
2025-05-07T20:31:49.7482593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.7482930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.7483239Z             )
2025-05-07T20:31:49.7483433Z         else:
2025-05-07T20:31:49.7483655Z             scale_ub_tensor = None
2025-05-07T20:31:49.7483904Z     
2025-05-07T20:31:49.7484140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.7484465Z             op = silu_mul_quant
2025-05-07T20:31:49.7484717Z             if compiled:
2025-05-07T20:31:49.7484968Z                 op = torch.compile(op)
2025-05-07T20:31:49.7485274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.7485547Z     
2025-05-07T20:31:49.7485746Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.7485915Z 
2025-05-07T20:31:49.7486024Z moe/activation_test.py:117: 
2025-05-07T20:31:49.7486327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.7486674Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.7486968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.7487688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.7488403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.7488963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.7489721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.7490412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.7490960Z     kernel = self.compile(
2025-05-07T20:31:49.7491520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.7492290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.7492701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.7492941Z 
2025-05-07T20:31:49.7493153Z self = <triton.compiler.compiler.ASTSource object at 0x7f44844edee0>
2025-05-07T20:31:49.7494363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.7495793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477844e00>}
2025-05-07T20:31:49.7497192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.7498253Z context = <triton._C.libtriton.ir.context object at 0x7f44773f3b30>
2025-05-07T20:31:49.7498563Z 
2025-05-07T20:31:49.7498733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.7499296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.7499814Z                            module_map=module_map)
2025-05-07T20:31:49.7500180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.7500544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.7500810Z E       ^
2025-05-07T20:31:49.7501285Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.7501761Z 
2025-05-07T20:31:49.7502194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.7502731Z 
2025-05-07T20:31:49.7502834Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.7503259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.7504196Z     T=1,
2025-05-07T20:31:49.7504386Z     D=5120,
2025-05-07T20:31:49.7504581Z     scale_ub=None,
2025-05-07T20:31:49.7504792Z     contiguous=True,
2025-05-07T20:31:49.7505026Z     compiled=True,
2025-05-07T20:31:49.7505234Z )
2025-05-07T20:31:49.9859106Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.9860456Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:49.9861860Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.9863369Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.9864419Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9865792Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.9867252Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9868290Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9869856Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.9871586Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9872914Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9874517Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.9876073Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:49.9877609Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.9879110Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:49.9880127Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9881398Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:49.9882663Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:49.9883788Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:49.9885308Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.9886908Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.9888296Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:49.9889587Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:49.9891064Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.9892705Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.9893826Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9894796Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9895580Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:49.9896659Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.0572219Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.0573444Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:50.0575020Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.0576518Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.0577542Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.0578904Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.0580364Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.0581396Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.0582686Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.0584288Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.0585407Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.0586750Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.0588056Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:50.0589367Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.0590671Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:50.0591539Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.0592612Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:50.0593685Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:50.0594525Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.0595880Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.0597226Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.0598398Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.0599538Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:50.0600773Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.0602202Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.0603315Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.0604272Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.0605053Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:50.0606113Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.2648173Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.2649987Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:50.2651399Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.2652961Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.2653972Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2655337Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.2656768Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.2657782Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2659059Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.2660491Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.2661966Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2663296Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.2664579Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:50.2665847Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.2667100Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:50.2667954Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2669008Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:50.2670103Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:50.2670923Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.2672175Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.2673657Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.2674802Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.2675874Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:50.2677088Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.2678496Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.2679594Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.2680524Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.2681284Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:50.2682329Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.2755570Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.2756823Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:50.2758378Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.2759862Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.2760865Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2762213Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.2763654Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.2764659Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2765922Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.2767346Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.2768594Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2769971Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.2771256Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:50.2772626Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.2773887Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:50.2774751Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.2775822Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:50.2776873Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:50.2777700Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.2778955Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.2780289Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.2781527Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.2782604Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:50.2783823Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.2785232Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.2786334Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.2787274Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.2788041Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:50.2789099Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.5000885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.5001654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:50.5002028Z 
2025-05-07T20:31:50.5002139Z     @given(
2025-05-07T20:31:50.5002723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.5003051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.5003359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.5003713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.5004057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.5004344Z     )
2025-05-07T20:31:50.5004707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.5005173Z     def test_silu_mul_quant(
2025-05-07T20:31:50.5005423Z         self,
2025-05-07T20:31:50.5005631Z         T: int,
2025-05-07T20:31:50.5005841Z         D: int,
2025-05-07T20:31:50.5006066Z         scale_ub: Optional[float],
2025-05-07T20:31:50.5006613Z         contiguous: bool,
2025-05-07T20:31:50.5006866Z         compiled: bool,
2025-05-07T20:31:50.5007100Z     ) -> None:
2025-05-07T20:31:50.5007332Z         torch.manual_seed(2025)
2025-05-07T20:31:50.5007591Z     
2025-05-07T20:31:50.5007873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.5008224Z     
2025-05-07T20:31:50.5008427Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.5008734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.5009049Z         x = x_sign * x_clamp
2025-05-07T20:31:50.5009297Z         x0 = x[:, :D]
2025-05-07T20:31:50.5009522Z         x1 = x[:, D:]
2025-05-07T20:31:50.5009732Z     
2025-05-07T20:31:50.5009925Z         if contiguous:
2025-05-07T20:31:50.5010165Z             x0 = x0.contiguous()
2025-05-07T20:31:50.5010431Z             x1 = x1.contiguous()
2025-05-07T20:31:50.5010680Z     
2025-05-07T20:31:50.5010886Z         if scale_ub is not None:
2025-05-07T20:31:50.5011162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.5011510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.5011903Z             )
2025-05-07T20:31:50.5012102Z         else:
2025-05-07T20:31:50.5012325Z             scale_ub_tensor = None
2025-05-07T20:31:50.5012587Z     
2025-05-07T20:31:50.5012821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.5013148Z             op = silu_mul_quant
2025-05-07T20:31:50.5013628Z             if compiled:
2025-05-07T20:31:50.5013905Z                 op = torch.compile(op)
2025-05-07T20:31:50.5014231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.5014543Z     
2025-05-07T20:31:50.5014751Z         y_fp8, y_scale = fn()
2025-05-07T20:31:50.5015061Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:50.5015389Z     
2025-05-07T20:31:50.5015650Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.5016029Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:50.5016362Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:50.5016719Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:50.5017123Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.5017485Z     
2025-05-07T20:31:50.5017708Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:50.5017930Z 
2025-05-07T20:31:50.5018048Z moe/activation_test.py:126: 
2025-05-07T20:31:50.5018386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.5018779Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:50.5019151Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.5020157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:50.5021082Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:50.5021734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.5022562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.5023392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:50.5024397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.5025295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:50.5026068Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:50.5026782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:50.5027400Z     fn()
2025-05-07T20:31:50.5028003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:50.5028699Z     self.fn.run(
2025-05-07T20:31:50.5029248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.5029934Z     kernel = self.compile(
2025-05-07T20:31:50.5030586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.5031367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.5031831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.5032101Z 
2025-05-07T20:31:50.5032343Z self = <triton.compiler.compiler.ASTSource object at 0x7f44775a9610>
2025-05-07T20:31:50.5033676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.5035450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477846fc0>}
2025-05-07T20:31:50.5036851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.5038034Z context = <triton._C.libtriton.ir.context object at 0x7f44773c5bb0>
2025-05-07T20:31:50.5038338Z 
2025-05-07T20:31:50.5038520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.5039065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.5039566Z                            module_map=module_map)
2025-05-07T20:31:50.5040002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.5040370Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:50.5040651Z E       ^
2025-05-07T20:31:50.5041143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.5041622Z 
2025-05-07T20:31:50.5042069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.5042624Z 
2025-05-07T20:31:50.5042730Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.5043175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.5043601Z     T=2048,
2025-05-07T20:31:50.5043802Z     D=5120,
2025-05-07T20:31:50.5043999Z     scale_ub=None,
2025-05-07T20:31:50.5044232Z     contiguous=True,
2025-05-07T20:31:50.5044468Z     compiled=True,
2025-05-07T20:31:50.5044680Z )
2025-05-07T20:31:50.7328816Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.7330782Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:50.7333678Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.7337163Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.7339164Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.7340714Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.7342151Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.7343184Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.7344453Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.7345877Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.7346980Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.7348310Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.7349757Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:50.7351036Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.7352292Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:50.7353158Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.7354227Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:50.7355303Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:50.7356126Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.7357388Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.7358723Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.7359885Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.7361093Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:50.7362335Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.7363758Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.7364874Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.7365825Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.7374077Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:50.7375184Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.8040198Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.8041356Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:50.8042747Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.8044708Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.8045733Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.8047088Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.8048535Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.8049587Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.8050894Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.8052435Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.8053543Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.8054877Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.8056333Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:50.8057607Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.8058872Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:50.8059732Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.8060797Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:50.8061859Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:50.8062689Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:50.8063942Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.8065278Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.8066444Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.8067543Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:50.8068848Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.8070312Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.8071419Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.8072367Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.8073134Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:50.8074195Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.0107933Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.0109136Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:51.0110543Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.0112055Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.0113535Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0114897Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.0116405Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.0117441Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0118741Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.0120245Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.0121365Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0122702Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.0124014Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:51.0125462Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.0126734Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:51.0127600Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0128676Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.0129749Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:51.0130640Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.0131975Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.0133312Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.0134484Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.0135583Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:51.0136905Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.0138332Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.0139442Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.0140449Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.0141222Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:51.0142286Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.0209141Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.0210249Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:51.0211647Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.0213191Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.0214227Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0215769Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.0217222Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.0218256Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0219549Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.0221062Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.0222176Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0223514Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.0224827Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:51.0226109Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.0227501Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:51.0228370Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.0229438Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.0230557Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:51.0231390Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.0232668Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.0234015Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.0235184Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.0236280Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:51.0237519Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.0239034Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.0240193Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.0241146Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.0241925Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:51.0242992Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.2336784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:51.2337314Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:51.2337620Z 
2025-05-07T20:31:51.2337711Z     @given(
2025-05-07T20:31:51.2337959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:51.2338279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:51.2338587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:51.2338924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:51.2339257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:51.2339537Z     )
2025-05-07T20:31:51.2339891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:51.2340349Z     def test_silu_mul_quant(
2025-05-07T20:31:51.2340592Z         self,
2025-05-07T20:31:51.2340790Z         T: int,
2025-05-07T20:31:51.2340997Z         D: int,
2025-05-07T20:31:51.2341392Z         scale_ub: Optional[float],
2025-05-07T20:31:51.2341657Z         contiguous: bool,
2025-05-07T20:31:51.2341895Z         compiled: bool,
2025-05-07T20:31:51.2342131Z     ) -> None:
2025-05-07T20:31:51.2342344Z         torch.manual_seed(2025)
2025-05-07T20:31:51.2342594Z     
2025-05-07T20:31:51.2342873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:51.2343215Z     
2025-05-07T20:31:51.2343409Z         x_sign = torch.sign(x)
2025-05-07T20:31:51.2343705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:51.2344014Z         x = x_sign * x_clamp
2025-05-07T20:31:51.2344257Z         x0 = x[:, :D]
2025-05-07T20:31:51.2344476Z         x1 = x[:, D:]
2025-05-07T20:31:51.2344682Z     
2025-05-07T20:31:51.2344868Z         if contiguous:
2025-05-07T20:31:51.2345103Z             x0 = x0.contiguous()
2025-05-07T20:31:51.2345361Z             x1 = x1.contiguous()
2025-05-07T20:31:51.2345607Z     
2025-05-07T20:31:51.2345800Z         if scale_ub is not None:
2025-05-07T20:31:51.2346084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:51.2346421Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:51.2346768Z             )
2025-05-07T20:31:51.2346967Z         else:
2025-05-07T20:31:51.2347187Z             scale_ub_tensor = None
2025-05-07T20:31:51.2347440Z     
2025-05-07T20:31:51.2347680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:51.2348004Z             op = silu_mul_quant
2025-05-07T20:31:51.2348252Z             if compiled:
2025-05-07T20:31:51.2348502Z                 op = torch.compile(op)
2025-05-07T20:31:51.2348807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:51.2349084Z     
2025-05-07T20:31:51.2349283Z         y_fp8, y_scale = fn()
2025-05-07T20:31:51.2349576Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:51.2349866Z     
2025-05-07T20:31:51.2350135Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:51.2350511Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:51.2350805Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:51.2351129Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:51.2351680Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:51.2352001Z     
2025-05-07T20:31:51.2352203Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:51.2352406Z 
2025-05-07T20:31:51.2352507Z moe/activation_test.py:126: 
2025-05-07T20:31:51.2352811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.2353151Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:51.2353488Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:51.2354308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:51.2355088Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:51.2355652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:51.2356356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:51.2357075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:51.2357816Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:51.2358572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:51.2359231Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:51.2359853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:51.2360383Z     fn()
2025-05-07T20:31:51.2360904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:51.2361590Z     self.fn.run(
2025-05-07T20:31:51.2362074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:51.2362621Z     kernel = self.compile(
2025-05-07T20:31:51.2363177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:51.2363853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.2364255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.2364497Z 
2025-05-07T20:31:51.2364708Z self = <triton.compiler.compiler.ASTSource object at 0x7f44775a9d90>
2025-05-07T20:31:51.2365833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:51.2367268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44777f54e0>}
2025-05-07T20:31:51.2368671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:51.2369727Z context = <triton._C.libtriton.ir.context object at 0x7f447727e2b0>
2025-05-07T20:31:51.2370032Z 
2025-05-07T20:31:51.2370242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:51.2370799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.2371282Z                            module_map=module_map)
2025-05-07T20:31:51.2371655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.2372094Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:51.2372375Z E       ^
2025-05-07T20:31:51.2372857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.2373323Z 
2025-05-07T20:31:51.2373841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:51.2374375Z 
2025-05-07T20:31:51.2374480Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:51.2374911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:51.2375330Z     T=128,
2025-05-07T20:31:51.2375517Z     D=5120,
2025-05-07T20:31:51.2375717Z     scale_ub=None,
2025-05-07T20:31:51.2375940Z     contiguous=True,
2025-05-07T20:31:51.2376162Z     compiled=True,
2025-05-07T20:31:51.2376372Z )
2025-05-07T20:31:51.4753332Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.4754455Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:51.4755859Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.4757336Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.4758356Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.4759732Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.4761378Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.4762415Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.4763697Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.4765148Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.4766272Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.4767621Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.4768934Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:51.4770207Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.4771475Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:51.4772429Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.4773623Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.4774700Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:51.4775530Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.4776801Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.4778155Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.4779336Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.4780430Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:51.4781672Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.4783100Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.4784305Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.4785266Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.4786041Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:51.4787114Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.5465792Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.5467985Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:51.5470241Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.5471721Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.5472740Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.5474098Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.5475551Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.5476727Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.5478012Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.5479447Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.5480561Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.5481913Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.5483220Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:51.5484497Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.5485755Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:51.5486620Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.5487816Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.5488888Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:51.5489718Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.5490985Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.5492387Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.5493564Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.5494667Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:51.5495895Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.5497313Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.5498430Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.5499387Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.5500239Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:51.5501307Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.7544475Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.7545583Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:51.7546976Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.7548473Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.7549484Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7550846Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.7552289Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.7553487Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7554770Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.7556202Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.7557315Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7558651Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.7560012Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:51.7561304Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.7562577Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:51.7563446Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7564523Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.7565718Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:51.7566557Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.7567827Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.7569177Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.7570352Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.7571457Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:51.7572772Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.7574192Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.7575313Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.7576271Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.7577166Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:51.7578234Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.7646849Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.7647959Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:51.7649353Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.7650841Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.7651921Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7653277Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.7654721Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.7655748Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7657173Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.7658624Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.7659735Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7661076Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.7662389Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:51.7663675Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.7664939Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:51.7665803Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.7666876Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:51.7668059Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:51.7668895Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:51.7670203Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.7671545Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.7672718Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.7673812Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:51.7675055Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.7676469Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.7677582Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.7678534Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.7679306Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:51.7680508Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0155339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.0156438Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:52.0156996Z 
2025-05-07T20:31:52.0157158Z     @given(
2025-05-07T20:31:52.0157635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.0158273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.0158882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.0159555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.0160222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.0160600Z     )
2025-05-07T20:31:52.0160987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.0161458Z     def test_silu_mul_quant(
2025-05-07T20:31:52.0161706Z         self,
2025-05-07T20:31:52.0161911Z         T: int,
2025-05-07T20:31:52.0162120Z         D: int,
2025-05-07T20:31:52.0162349Z         scale_ub: Optional[float],
2025-05-07T20:31:52.0162629Z         contiguous: bool,
2025-05-07T20:31:52.0162879Z         compiled: bool,
2025-05-07T20:31:52.0163106Z     ) -> None:
2025-05-07T20:31:52.0163333Z         torch.manual_seed(2025)
2025-05-07T20:31:52.0163588Z     
2025-05-07T20:31:52.0163869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.0164218Z     
2025-05-07T20:31:52.0164420Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.0164720Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.0165037Z         x = x_sign * x_clamp
2025-05-07T20:31:52.0165294Z         x0 = x[:, :D]
2025-05-07T20:31:52.0165522Z         x1 = x[:, D:]
2025-05-07T20:31:52.0165900Z     
2025-05-07T20:31:52.0166094Z         if contiguous:
2025-05-07T20:31:52.0166335Z             x0 = x0.contiguous()
2025-05-07T20:31:52.0166600Z             x1 = x1.contiguous()
2025-05-07T20:31:52.0166853Z     
2025-05-07T20:31:52.0167061Z         if scale_ub is not None:
2025-05-07T20:31:52.0167340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.0167689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.0168013Z             )
2025-05-07T20:31:52.0168209Z         else:
2025-05-07T20:31:52.0168441Z             scale_ub_tensor = None
2025-05-07T20:31:52.0168696Z     
2025-05-07T20:31:52.0168938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.0169266Z             op = silu_mul_quant
2025-05-07T20:31:52.0169522Z             if compiled:
2025-05-07T20:31:52.0177042Z                 op = torch.compile(op)
2025-05-07T20:31:52.0177358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.0177647Z     
2025-05-07T20:31:52.0177858Z         y_fp8, y_scale = fn()
2025-05-07T20:31:52.0178153Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:52.0178457Z     
2025-05-07T20:31:52.0178707Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.0179057Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:52.0179371Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:52.0179697Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:52.0180112Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.0180435Z     
2025-05-07T20:31:52.0180646Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:52.0180851Z 
2025-05-07T20:31:52.0180967Z moe/activation_test.py:126: 
2025-05-07T20:31:52.0181276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0181630Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:52.0181974Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.0182799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:52.0183582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:52.0184317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.0185027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.0185744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:52.0186495Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.0187248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:52.0187913Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:52.0188547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:52.0189085Z     fn()
2025-05-07T20:31:52.0189611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:52.0190222Z     self.fn.run(
2025-05-07T20:31:52.0190710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.0191258Z     kernel = self.compile(
2025-05-07T20:31:52.0191819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.0192501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.0192917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.0193155Z 
2025-05-07T20:31:52.0193368Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477402810>
2025-05-07T20:31:52.0194609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.0196039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476b59ee0>}
2025-05-07T20:31:52.0197441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.0198506Z context = <triton._C.libtriton.ir.context object at 0x7f447713e6f0>
2025-05-07T20:31:52.0198803Z 
2025-05-07T20:31:52.0198974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.0199521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.0200008Z                            module_map=module_map)
2025-05-07T20:31:52.0200397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.0200804Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:52.0201077Z E       ^
2025-05-07T20:31:52.0201555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0202019Z 
2025-05-07T20:31:52.0202449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.0202986Z 
2025-05-07T20:31:52.0203091Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.0203519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.0203937Z     T=4096,
2025-05-07T20:31:52.0204125Z     D=5120,
2025-05-07T20:31:52.0204328Z     scale_ub=None,
2025-05-07T20:31:52.0204551Z     contiguous=True,
2025-05-07T20:31:52.0204772Z     compiled=True,
2025-05-07T20:31:52.0204983Z )
2025-05-07T20:31:52.2631617Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.2632733Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:52.2634115Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.2635586Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.2636595Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.2637958Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.2639405Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.2640423Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.2641700Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.2643268Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.2644378Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.2645718Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.2647014Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:52.2648287Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.2649562Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:52.2650478Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.2651545Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:52.2652667Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:52.2653497Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.2654847Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.2656188Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.2657359Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.2658443Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:52.2659677Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.2661155Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.2662268Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.2663210Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.2663982Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:52.2665043Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.3344735Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.3346001Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:52.3347377Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.3348855Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.3349867Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.3351279Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.3352714Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.3353731Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.3354996Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.3356431Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.3357684Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.3359014Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.3360360Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:52.3361623Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.3362892Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:52.3363758Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.3364825Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:52.3365879Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:52.3366702Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.3367963Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.3369414Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.3370625Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.3371702Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:52.3373009Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.3374426Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.3375539Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.3376484Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.3377256Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:52.3378312Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.5438035Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.5439164Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:52.5441071Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.5444037Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.5446060Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5448770Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.5450891Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.5451983Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5453268Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.5454711Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.5455959Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5457306Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.5458614Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:52.5459897Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.5461216Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:52.5462087Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5463169Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:52.5464238Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:52.5465072Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.5466339Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.5467680Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.5468934Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.5470035Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:52.5471272Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.5472699Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.5473815Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.5474778Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.5475555Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:52.5476626Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.5534819Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.5535930Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:52.5537486Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.5538972Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.5539988Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5541398Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.5542853Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.5543889Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5545175Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.5546619Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.5547733Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5549190Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.5550509Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:52.5551839Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.5553108Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:52.5553977Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.5555060Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:31:52.5556130Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:52.5556963Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:52.5558234Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.5559575Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.5560885Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.5561983Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:52.5563225Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.5564653Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.5565765Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.5566728Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.5567508Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:52.5568579Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.8091874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.8092411Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:52.8092739Z 
2025-05-07T20:31:52.8092825Z     @given(
2025-05-07T20:31:52.8093066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.8093386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.8093695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.8094046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.8094378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.8094668Z     )
2025-05-07T20:31:52.8095213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.8095680Z     def test_silu_mul_quant(
2025-05-07T20:31:52.8095931Z         self,
2025-05-07T20:31:52.8096129Z         T: int,
2025-05-07T20:31:52.8096330Z         D: int,
2025-05-07T20:31:52.8096551Z         scale_ub: Optional[float],
2025-05-07T20:31:52.8096821Z         contiguous: bool,
2025-05-07T20:31:52.8097067Z         compiled: bool,
2025-05-07T20:31:52.8097298Z     ) -> None:
2025-05-07T20:31:52.8097512Z         torch.manual_seed(2025)
2025-05-07T20:31:52.8097759Z     
2025-05-07T20:31:52.8098035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.8098384Z     
2025-05-07T20:31:52.8098587Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.8098891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.8099203Z         x = x_sign * x_clamp
2025-05-07T20:31:52.8099449Z         x0 = x[:, :D]
2025-05-07T20:31:52.8099670Z         x1 = x[:, D:]
2025-05-07T20:31:52.8099885Z     
2025-05-07T20:31:52.8100070Z         if contiguous:
2025-05-07T20:31:52.8100307Z             x0 = x0.contiguous()
2025-05-07T20:31:52.8100577Z             x1 = x1.contiguous()
2025-05-07T20:31:52.8100816Z     
2025-05-07T20:31:52.8101011Z         if scale_ub is not None:
2025-05-07T20:31:52.8101292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.8101633Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.8101946Z             )
2025-05-07T20:31:52.8102146Z         else:
2025-05-07T20:31:52.8102354Z             scale_ub_tensor = None
2025-05-07T20:31:52.8102609Z     
2025-05-07T20:31:52.8102846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.8103292Z             op = silu_mul_quant
2025-05-07T20:31:52.8103549Z             if compiled:
2025-05-07T20:31:52.8103802Z                 op = torch.compile(op)
2025-05-07T20:31:52.8104100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.8104379Z     
2025-05-07T20:31:52.8104582Z         y_fp8, y_scale = fn()
2025-05-07T20:31:52.8104875Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:52.8105169Z     
2025-05-07T20:31:52.8105410Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.8105750Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:52.8106046Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:52.8106547Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:52.8106922Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.8107234Z     
2025-05-07T20:31:52.8107441Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:52.8107639Z 
2025-05-07T20:31:52.8107745Z moe/activation_test.py:126: 
2025-05-07T20:31:52.8108052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.8108400Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:52.8108740Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.8109554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:52.8110382Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:52.8110943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.8111648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.8112357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:52.8113102Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.8113856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:52.8114700Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:52.8115322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:52.8115859Z     fn()
2025-05-07T20:31:52.8116386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:52.8116990Z     self.fn.run(
2025-05-07T20:31:52.8117469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.8118018Z     kernel = self.compile(
2025-05-07T20:31:52.8118575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.8119244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.8119659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.8119898Z 
2025-05-07T20:31:52.8120128Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477442f30>
2025-05-07T20:31:52.8121285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.8122712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44768a8ea0>}
2025-05-07T20:31:52.8124104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.8125290Z context = <triton._C.libtriton.ir.context object at 0x7f447634bd70>
2025-05-07T20:31:52.8125596Z 
2025-05-07T20:31:52.8125768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.8126315Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.8126801Z                            module_map=module_map)
2025-05-07T20:31:52.8127179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.8127555Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:52.8127826Z E       ^
2025-05-07T20:31:52.8128325Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.8128797Z 
2025-05-07T20:31:52.8129230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.8129766Z 
2025-05-07T20:31:52.8129879Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.8130312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.8130778Z     T=16384,
2025-05-07T20:31:52.8130981Z     D=5120,
2025-05-07T20:31:52.8131179Z     scale_ub=None,
2025-05-07T20:31:52.8131404Z     contiguous=True,
2025-05-07T20:31:52.8131632Z     compiled=True,
2025-05-07T20:31:52.8131905Z )
2025-05-07T20:31:52.8431044Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:52.8433648Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:52.8436442Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:52.8438522Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:52.8440888Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:52.9317279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.9317838Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:52.9318135Z 
2025-05-07T20:31:52.9318224Z     @given(
2025-05-07T20:31:52.9318458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.9318769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.9319076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.9319410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.9319734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.9320030Z     )
2025-05-07T20:31:52.9320383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.9320832Z     def test_silu_mul_quant(
2025-05-07T20:31:52.9321078Z         self,
2025-05-07T20:31:52.9321267Z         T: int,
2025-05-07T20:31:52.9321461Z         D: int,
2025-05-07T20:31:52.9321675Z         scale_ub: Optional[float],
2025-05-07T20:31:52.9321946Z         contiguous: bool,
2025-05-07T20:31:52.9322181Z         compiled: bool,
2025-05-07T20:31:52.9322409Z     ) -> None:
2025-05-07T20:31:52.9322647Z         torch.manual_seed(2025)
2025-05-07T20:31:52.9322897Z     
2025-05-07T20:31:52.9323170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.9323521Z     
2025-05-07T20:31:52.9323716Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.9324012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.9324321Z         x = x_sign * x_clamp
2025-05-07T20:31:52.9324733Z         x0 = x[:, :D]
2025-05-07T20:31:52.9324952Z         x1 = x[:, D:]
2025-05-07T20:31:52.9325155Z     
2025-05-07T20:31:52.9325344Z         if contiguous:
2025-05-07T20:31:52.9325576Z             x0 = x0.contiguous()
2025-05-07T20:31:52.9325836Z             x1 = x1.contiguous()
2025-05-07T20:31:52.9326078Z     
2025-05-07T20:31:52.9326269Z         if scale_ub is not None:
2025-05-07T20:31:52.9326540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.9326883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.9327197Z             )
2025-05-07T20:31:52.9327390Z         else:
2025-05-07T20:31:52.9327601Z             scale_ub_tensor = None
2025-05-07T20:31:52.9327858Z     
2025-05-07T20:31:52.9328086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.9328405Z             op = silu_mul_quant
2025-05-07T20:31:52.9328656Z             if compiled:
2025-05-07T20:31:52.9328907Z                 op = torch.compile(op)
2025-05-07T20:31:52.9329210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.9329493Z     
2025-05-07T20:31:52.9329685Z         y_fp8, y_scale = fn()
2025-05-07T20:31:52.9329975Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:52.9330276Z     
2025-05-07T20:31:52.9330523Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.9330863Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:52.9331161Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:52.9331481Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:52.9331935Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.9332254Z     
2025-05-07T20:31:52.9332454Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:52.9332654Z 
2025-05-07T20:31:52.9332757Z moe/activation_test.py:126: 
2025-05-07T20:31:52.9333053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.9333400Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:52.9333734Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.9334675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:52.9335465Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:52.9336032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.9336742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.9337454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:52.9338204Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.9338966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:52.9339637Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:52.9340284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:52.9340855Z     fn()
2025-05-07T20:31:52.9341383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:52.9341988Z     self.fn.run(
2025-05-07T20:31:52.9342469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.9343019Z     kernel = self.compile(
2025-05-07T20:31:52.9343569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.9344249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.9344657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.9344979Z 
2025-05-07T20:31:52.9345193Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476e52c60>
2025-05-07T20:31:52.9346317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.9347747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44768a9bc0>}
2025-05-07T20:31:52.9349152Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.9350245Z context = <triton._C.libtriton.ir.context object at 0x7f4475d61130>
2025-05-07T20:31:52.9350567Z 
2025-05-07T20:31:52.9350742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.9351281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.9351768Z                            module_map=module_map)
2025-05-07T20:31:52.9352142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.9352512Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:52.9352789Z E       ^
2025-05-07T20:31:52.9353271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.9353739Z 
2025-05-07T20:31:52.9354172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.9361008Z 
2025-05-07T20:31:52.9361140Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.9361575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.9361998Z     T=1,
2025-05-07T20:31:52.9362198Z     D=5120,
2025-05-07T20:31:52.9362390Z     scale_ub=1200.0,
2025-05-07T20:31:52.9362619Z     contiguous=True,
2025-05-07T20:31:52.9362845Z     compiled=True,
2025-05-07T20:31:52.9363049Z )
2025-05-07T20:31:53.0739362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.0739915Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:53.0740198Z 
2025-05-07T20:31:53.0740277Z     @given(
2025-05-07T20:31:53.0740511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.0740841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.0741174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.0741509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.0741836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.0742123Z     )
2025-05-07T20:31:53.0742476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.0742934Z     def test_silu_mul_quant(
2025-05-07T20:31:53.0743182Z         self,
2025-05-07T20:31:53.0743381Z         T: int,
2025-05-07T20:31:53.0743577Z         D: int,
2025-05-07T20:31:53.0743803Z         scale_ub: Optional[float],
2025-05-07T20:31:53.0744080Z         contiguous: bool,
2025-05-07T20:31:53.0744319Z         compiled: bool,
2025-05-07T20:31:53.0744552Z     ) -> None:
2025-05-07T20:31:53.0744773Z         torch.manual_seed(2025)
2025-05-07T20:31:53.0745018Z     
2025-05-07T20:31:53.0745292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.0745638Z     
2025-05-07T20:31:53.0745836Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.0746127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.0746443Z         x = x_sign * x_clamp
2025-05-07T20:31:53.0746684Z         x0 = x[:, :D]
2025-05-07T20:31:53.0746897Z         x1 = x[:, D:]
2025-05-07T20:31:53.0747111Z     
2025-05-07T20:31:53.0747429Z         if contiguous:
2025-05-07T20:31:53.0747661Z             x0 = x0.contiguous()
2025-05-07T20:31:53.0747928Z             x1 = x1.contiguous()
2025-05-07T20:31:53.0748174Z     
2025-05-07T20:31:53.0748366Z         if scale_ub is not None:
2025-05-07T20:31:53.0748643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.0748983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.0749293Z             )
2025-05-07T20:31:53.0749495Z         else:
2025-05-07T20:31:53.0749714Z             scale_ub_tensor = None
2025-05-07T20:31:53.0749988Z     
2025-05-07T20:31:53.0750219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.0750538Z             op = silu_mul_quant
2025-05-07T20:31:53.0750793Z             if compiled:
2025-05-07T20:31:53.0751039Z                 op = torch.compile(op)
2025-05-07T20:31:53.0751342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.0751622Z     
2025-05-07T20:31:53.0751812Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.0751990Z 
2025-05-07T20:31:53.0752093Z moe/activation_test.py:117: 
2025-05-07T20:31:53.0752396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.0752741Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.0753024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.0753603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:53.0754178Z     return fn(*args, **kwargs)
2025-05-07T20:31:53.0754853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.0755561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.0756109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.0756815Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.0757498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.0758047Z     kernel = self.compile(
2025-05-07T20:31:53.0758687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.0759365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.0759767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.0760006Z 
2025-05-07T20:31:53.0760216Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477441eb0>
2025-05-07T20:31:53.0761388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.0762812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476765620>}
2025-05-07T20:31:53.0764207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.0765267Z context = <triton._C.libtriton.ir.context object at 0x7f4475d86bf0>
2025-05-07T20:31:53.0765566Z 
2025-05-07T20:31:53.0765735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.0766281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.0766759Z                            module_map=module_map)
2025-05-07T20:31:53.0767130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.0767489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.0767745Z E       ^
2025-05-07T20:31:53.0768308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.0768779Z 
2025-05-07T20:31:53.0769215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.0769743Z 
2025-05-07T20:31:53.0769853Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.0770276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.0770688Z     T=1,
2025-05-07T20:31:53.0770874Z     D=5120,
2025-05-07T20:31:53.0771092Z     scale_ub=None,
2025-05-07T20:31:53.0771337Z     contiguous=False,
2025-05-07T20:31:53.0771560Z     compiled=True,
2025-05-07T20:31:53.0771821Z )
2025-05-07T20:31:53.1378450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.1379185Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:53.1379533Z 
2025-05-07T20:31:53.1379648Z     @given(
2025-05-07T20:31:53.1379888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.1380212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.1380519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.1380868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.1381226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.1381517Z     )
2025-05-07T20:31:53.1381872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.1382335Z     def test_silu_mul_quant(
2025-05-07T20:31:53.1382587Z         self,
2025-05-07T20:31:53.1382792Z         T: int,
2025-05-07T20:31:53.1382989Z         D: int,
2025-05-07T20:31:53.1383220Z         scale_ub: Optional[float],
2025-05-07T20:31:53.1383505Z         contiguous: bool,
2025-05-07T20:31:53.1383747Z         compiled: bool,
2025-05-07T20:31:53.1383991Z     ) -> None:
2025-05-07T20:31:53.1384218Z         torch.manual_seed(2025)
2025-05-07T20:31:53.1384468Z     
2025-05-07T20:31:53.1384753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.1385114Z     
2025-05-07T20:31:53.1385311Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.1385982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.1386309Z         x = x_sign * x_clamp
2025-05-07T20:31:53.1386555Z         x0 = x[:, :D]
2025-05-07T20:31:53.1386780Z         x1 = x[:, D:]
2025-05-07T20:31:53.1386997Z     
2025-05-07T20:31:53.1387185Z         if contiguous:
2025-05-07T20:31:53.1387429Z             x0 = x0.contiguous()
2025-05-07T20:31:53.1387701Z             x1 = x1.contiguous()
2025-05-07T20:31:53.1387941Z     
2025-05-07T20:31:53.1388139Z         if scale_ub is not None:
2025-05-07T20:31:53.1388421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.1388772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.1389080Z             )
2025-05-07T20:31:53.1389278Z         else:
2025-05-07T20:31:53.1389501Z             scale_ub_tensor = None
2025-05-07T20:31:53.1389754Z     
2025-05-07T20:31:53.1389995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.1390321Z             op = silu_mul_quant
2025-05-07T20:31:53.1390576Z             if compiled:
2025-05-07T20:31:53.1390833Z                 op = torch.compile(op)
2025-05-07T20:31:53.1391141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.1391418Z     
2025-05-07T20:31:53.1391622Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.1391919Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.1392210Z     
2025-05-07T20:31:53.1392454Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.1392800Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.1393102Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.1393425Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.1393799Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.1394285Z     
2025-05-07T20:31:53.1394484Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.1394694Z 
2025-05-07T20:31:53.1394799Z moe/activation_test.py:126: 
2025-05-07T20:31:53.1395110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.1395452Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.1395793Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.1396615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.1397403Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.1397966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.1398680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.1399403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.1400154Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.1400909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.1401574Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.1402200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.1402729Z     fn()
2025-05-07T20:31:53.1403257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.1403861Z     self.fn.run(
2025-05-07T20:31:53.1404349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.1404898Z     kernel = self.compile(
2025-05-07T20:31:53.1405462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.1406642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.1407074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.1407321Z 
2025-05-07T20:31:53.1407537Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476763d10>
2025-05-07T20:31:53.1408661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.1410100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476766e80>}
2025-05-07T20:31:53.1411495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.1412649Z context = <triton._C.libtriton.ir.context object at 0x7f44752e0cb0>
2025-05-07T20:31:53.1412955Z 
2025-05-07T20:31:53.1413129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.1413691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.1414184Z                            module_map=module_map)
2025-05-07T20:31:53.1414574Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.1414937Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.1415217Z E       ^
2025-05-07T20:31:53.1415704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.1416170Z 
2025-05-07T20:31:53.1416749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.1417289Z 
2025-05-07T20:31:53.1417398Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.1417834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.1418254Z     T=1,
2025-05-07T20:31:53.1418438Z     D=5120,
2025-05-07T20:31:53.1418643Z     scale_ub=None,
2025-05-07T20:31:53.1418867Z     contiguous=True,
2025-05-07T20:31:53.1419094Z     compiled=False,
2025-05-07T20:31:53.1419311Z )
2025-05-07T20:31:53.2909057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.2909807Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:53.2910178Z 
2025-05-07T20:31:53.2910285Z     @given(
2025-05-07T20:31:53.2910587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.2911027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.2911470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.2911880Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.2912214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.2912506Z     )
2025-05-07T20:31:53.2912866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.2913327Z     def test_silu_mul_quant(
2025-05-07T20:31:53.2913575Z         self,
2025-05-07T20:31:53.2913773Z         T: int,
2025-05-07T20:31:53.2913977Z         D: int,
2025-05-07T20:31:53.2914203Z         scale_ub: Optional[float],
2025-05-07T20:31:53.2914475Z         contiguous: bool,
2025-05-07T20:31:53.2914722Z         compiled: bool,
2025-05-07T20:31:53.2914960Z     ) -> None:
2025-05-07T20:31:53.2915175Z         torch.manual_seed(2025)
2025-05-07T20:31:53.2915425Z     
2025-05-07T20:31:53.2915704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.2916054Z     
2025-05-07T20:31:53.2916257Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.2916562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.2916874Z         x = x_sign * x_clamp
2025-05-07T20:31:53.2917464Z         x0 = x[:, :D]
2025-05-07T20:31:53.2917699Z         x1 = x[:, D:]
2025-05-07T20:31:53.2917906Z     
2025-05-07T20:31:53.2918101Z         if contiguous:
2025-05-07T20:31:53.2918345Z             x0 = x0.contiguous()
2025-05-07T20:31:53.2918604Z             x1 = x1.contiguous()
2025-05-07T20:31:53.2918855Z     
2025-05-07T20:31:53.2919057Z         if scale_ub is not None:
2025-05-07T20:31:53.2919342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.2919679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.2919998Z             )
2025-05-07T20:31:53.2920201Z         else:
2025-05-07T20:31:53.2920435Z             scale_ub_tensor = None
2025-05-07T20:31:53.2920721Z     
2025-05-07T20:31:53.2920988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.2921325Z             op = silu_mul_quant
2025-05-07T20:31:53.2921586Z             if compiled:
2025-05-07T20:31:53.2921837Z                 op = torch.compile(op)
2025-05-07T20:31:53.2922150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.2922439Z     
2025-05-07T20:31:53.2922634Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.2922812Z 
2025-05-07T20:31:53.2922914Z moe/activation_test.py:117: 
2025-05-07T20:31:53.2923223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.2923559Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.2923855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.2924572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.2925287Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.2925835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.2926698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.2927398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.2927946Z     kernel = self.compile(
2025-05-07T20:31:53.2928508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.2929190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.2929603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.2929841Z 
2025-05-07T20:31:53.2930054Z self = <triton.compiler.compiler.ASTSource object at 0x7f447641e240>
2025-05-07T20:31:53.2931175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.2932761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476765ee0>}
2025-05-07T20:31:53.2934161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.2935228Z context = <triton._C.libtriton.ir.context object at 0x7f447531fd70>
2025-05-07T20:31:53.2935525Z 
2025-05-07T20:31:53.2935698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.2936243Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.2936728Z                            module_map=module_map)
2025-05-07T20:31:53.2937096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.2937468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.2937741Z E       ^
2025-05-07T20:31:53.2938351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.2938822Z 
2025-05-07T20:31:53.2939252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.2939792Z 
2025-05-07T20:31:53.2939898Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.2940335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.2940759Z     T=128,
2025-05-07T20:31:53.2940952Z     D=5120,
2025-05-07T20:31:53.2941154Z     scale_ub=None,
2025-05-07T20:31:53.2941386Z     contiguous=False,
2025-05-07T20:31:53.2941618Z     compiled=True,
2025-05-07T20:31:53.2941833Z )
2025-05-07T20:31:53.2942171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.2942685Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:53.2942975Z 
2025-05-07T20:31:53.2943057Z     @given(
2025-05-07T20:31:53.2943310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.2943630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.2943953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.2944296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.2944645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.2944942Z     )
2025-05-07T20:31:53.2945310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.2945770Z     def test_silu_mul_quant(
2025-05-07T20:31:53.2946017Z         self,
2025-05-07T20:31:53.2946226Z         T: int,
2025-05-07T20:31:53.2946438Z         D: int,
2025-05-07T20:31:53.2946659Z         scale_ub: Optional[float],
2025-05-07T20:31:53.2947035Z         contiguous: bool,
2025-05-07T20:31:53.2947288Z         compiled: bool,
2025-05-07T20:31:53.2947515Z     ) -> None:
2025-05-07T20:31:53.2947739Z         torch.manual_seed(2025)
2025-05-07T20:31:53.2947992Z     
2025-05-07T20:31:53.2948270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.2948624Z     
2025-05-07T20:31:53.2948828Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.2949120Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.2949441Z         x = x_sign * x_clamp
2025-05-07T20:31:53.2949697Z         x0 = x[:, :D]
2025-05-07T20:31:53.2949924Z         x1 = x[:, D:]
2025-05-07T20:31:53.2950136Z     
2025-05-07T20:31:53.2950351Z         if contiguous:
2025-05-07T20:31:53.2950622Z             x0 = x0.contiguous()
2025-05-07T20:31:53.2950883Z             x1 = x1.contiguous()
2025-05-07T20:31:53.2951139Z     
2025-05-07T20:31:53.2951340Z         if scale_ub is not None:
2025-05-07T20:31:53.2951618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.2951972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.2952290Z             )
2025-05-07T20:31:53.2952489Z         else:
2025-05-07T20:31:53.2952712Z             scale_ub_tensor = None
2025-05-07T20:31:53.2952978Z     
2025-05-07T20:31:53.2953215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.2953546Z             op = silu_mul_quant
2025-05-07T20:31:53.2953809Z             if compiled:
2025-05-07T20:31:53.2954061Z                 op = torch.compile(op)
2025-05-07T20:31:53.2954371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.2954659Z     
2025-05-07T20:31:53.2954855Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.2955032Z 
2025-05-07T20:31:53.2955136Z moe/activation_test.py:117: 
2025-05-07T20:31:53.2955446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.2955793Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.2956079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.2956661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:53.2957245Z     return fn(*args, **kwargs)
2025-05-07T20:31:53.2958004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.2958724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.2959283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.2959989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.2960674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.2961230Z     kernel = self.compile(
2025-05-07T20:31:53.2961792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.2962483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.2962889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.2963143Z 
2025-05-07T20:31:53.2963356Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475efd640>
2025-05-07T20:31:53.2964476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.2965899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44767a5da0>}
2025-05-07T20:31:53.2967289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.2968437Z context = <triton._C.libtriton.ir.context object at 0x7f4475505730>
2025-05-07T20:31:53.2968739Z 
2025-05-07T20:31:53.2968915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.2969458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.2969936Z                            module_map=module_map)
2025-05-07T20:31:53.2970329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.2970726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.2970988Z E       ^
2025-05-07T20:31:53.2971473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.2972024Z 
2025-05-07T20:31:53.2972454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.2972990Z 
2025-05-07T20:31:53.2973107Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.2973530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.2973954Z     T=128,
2025-05-07T20:31:53.2974158Z     D=7168,
2025-05-07T20:31:53.2974356Z     scale_ub=1200.0,
2025-05-07T20:31:53.2974595Z     contiguous=False,
2025-05-07T20:31:53.2974833Z     compiled=False,
2025-05-07T20:31:53.2975039Z )
2025-05-07T20:31:53.4097394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.4098150Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:53.4098488Z 
2025-05-07T20:31:53.4098571Z     @given(
2025-05-07T20:31:53.4098818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.4099144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.4099451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.4099791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.4100156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.4100469Z     )
2025-05-07T20:31:53.4101239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.4101722Z     def test_silu_mul_quant(
2025-05-07T20:31:53.4101981Z         self,
2025-05-07T20:31:53.4102184Z         T: int,
2025-05-07T20:31:53.4102389Z         D: int,
2025-05-07T20:31:53.4112458Z         scale_ub: Optional[float],
2025-05-07T20:31:53.4112789Z         contiguous: bool,
2025-05-07T20:31:53.4113057Z         compiled: bool,
2025-05-07T20:31:53.4113307Z     ) -> None:
2025-05-07T20:31:53.4113531Z         torch.manual_seed(2025)
2025-05-07T20:31:53.4113790Z     
2025-05-07T20:31:53.4114080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.4114466Z     
2025-05-07T20:31:53.4114674Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.4114973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.4115320Z         x = x_sign * x_clamp
2025-05-07T20:31:53.4115581Z         x0 = x[:, :D]
2025-05-07T20:31:53.4115803Z         x1 = x[:, D:]
2025-05-07T20:31:53.4116031Z     
2025-05-07T20:31:53.4116239Z         if contiguous:
2025-05-07T20:31:53.4116475Z             x0 = x0.contiguous()
2025-05-07T20:31:53.4116750Z             x1 = x1.contiguous()
2025-05-07T20:31:53.4116991Z     
2025-05-07T20:31:53.4117197Z         if scale_ub is not None:
2025-05-07T20:31:53.4117485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.4117831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.4118157Z             )
2025-05-07T20:31:53.4118367Z         else:
2025-05-07T20:31:53.4118581Z             scale_ub_tensor = None
2025-05-07T20:31:53.4118844Z     
2025-05-07T20:31:53.4119088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4119420Z             op = silu_mul_quant
2025-05-07T20:31:53.4119674Z             if compiled:
2025-05-07T20:31:53.4120175Z                 op = torch.compile(op)
2025-05-07T20:31:53.4120487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4120767Z     
2025-05-07T20:31:53.4120972Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.4121145Z 
2025-05-07T20:31:53.4121258Z moe/activation_test.py:117: 
2025-05-07T20:31:53.4121562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4121913Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.4122205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4123027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.4123867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.4124428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4125135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4125835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4126393Z     kernel = self.compile(
2025-05-07T20:31:53.4126963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4127648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4128057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4128302Z 
2025-05-07T20:31:53.4128517Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475efec60>
2025-05-07T20:31:53.4129641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4131083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7ab60>}
2025-05-07T20:31:53.4132755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4133830Z context = <triton._C.libtriton.ir.context object at 0x7f44751ab730>
2025-05-07T20:31:53.4134134Z 
2025-05-07T20:31:53.4134305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4134847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4135323Z                            module_map=module_map)
2025-05-07T20:31:53.4135703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4136070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4136345Z E       ^
2025-05-07T20:31:53.4136822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4137297Z 
2025-05-07T20:31:53.4137731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.4138260Z 
2025-05-07T20:31:53.4138376Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4138798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4139216Z     T=128,
2025-05-07T20:31:53.4139417Z     D=5120,
2025-05-07T20:31:53.4139620Z     scale_ub=None,
2025-05-07T20:31:53.4139839Z     contiguous=False,
2025-05-07T20:31:53.4140076Z     compiled=False,
2025-05-07T20:31:53.4140297Z )
2025-05-07T20:31:53.4140621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.4141135Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.4141501Z 
2025-05-07T20:31:53.4141590Z     @given(
2025-05-07T20:31:53.4141824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.4142149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.4142471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.4142807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.4143152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.4143446Z     )
2025-05-07T20:31:53.4143809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.4144258Z     def test_silu_mul_quant(
2025-05-07T20:31:53.4144513Z         self,
2025-05-07T20:31:53.4144718Z         T: int,
2025-05-07T20:31:53.4144919Z         D: int,
2025-05-07T20:31:53.4145149Z         scale_ub: Optional[float],
2025-05-07T20:31:53.4145432Z         contiguous: bool,
2025-05-07T20:31:53.4145673Z         compiled: bool,
2025-05-07T20:31:53.4145903Z     ) -> None:
2025-05-07T20:31:53.4146130Z         torch.manual_seed(2025)
2025-05-07T20:31:53.4146373Z     
2025-05-07T20:31:53.4146653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.4147004Z     
2025-05-07T20:31:53.4147202Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.4147503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.4147824Z         x = x_sign * x_clamp
2025-05-07T20:31:53.4148065Z         x0 = x[:, :D]
2025-05-07T20:31:53.4148290Z         x1 = x[:, D:]
2025-05-07T20:31:53.4148508Z     
2025-05-07T20:31:53.4148702Z         if contiguous:
2025-05-07T20:31:53.4148934Z             x0 = x0.contiguous()
2025-05-07T20:31:53.4149202Z             x1 = x1.contiguous()
2025-05-07T20:31:53.4149451Z     
2025-05-07T20:31:53.4149641Z         if scale_ub is not None:
2025-05-07T20:31:53.4149919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.4150259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.4150574Z             )
2025-05-07T20:31:53.4150771Z         else:
2025-05-07T20:31:53.4150987Z             scale_ub_tensor = None
2025-05-07T20:31:53.4151232Z     
2025-05-07T20:31:53.4151472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4151883Z             op = silu_mul_quant
2025-05-07T20:31:53.4152135Z             if compiled:
2025-05-07T20:31:53.4152385Z                 op = torch.compile(op)
2025-05-07T20:31:53.4152686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4152960Z     
2025-05-07T20:31:53.4153157Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.4153331Z 
2025-05-07T20:31:53.4153431Z moe/activation_test.py:117: 
2025-05-07T20:31:53.4153735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4154071Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.4154359Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4155070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.4155780Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.4156339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4157042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4157727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4158272Z     kernel = self.compile(
2025-05-07T20:31:53.4158830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4159508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4159911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4160156Z 
2025-05-07T20:31:53.4160366Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea67b0>
2025-05-07T20:31:53.4161570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4162994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7b060>}
2025-05-07T20:31:53.4164386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4165459Z context = <triton._C.libtriton.ir.context object at 0x7f44751cc530>
2025-05-07T20:31:53.4165762Z 
2025-05-07T20:31:53.4165933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4166473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4166964Z                            module_map=module_map)
2025-05-07T20:31:53.4167334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4167701Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4167965Z E       ^
2025-05-07T20:31:53.4168443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4168924Z 
2025-05-07T20:31:53.4169359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.4169905Z 
2025-05-07T20:31:53.4170010Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4170441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4170854Z     T=128,
2025-05-07T20:31:53.4171045Z     D=5120,
2025-05-07T20:31:53.4171247Z     scale_ub=1200.0,
2025-05-07T20:31:53.4171474Z     contiguous=True,
2025-05-07T20:31:53.4171700Z     compiled=False,
2025-05-07T20:31:53.4172051Z )
2025-05-07T20:31:53.7981019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7982217Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:53.7982515Z 
2025-05-07T20:31:53.7982594Z     @given(
2025-05-07T20:31:53.7982829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7983147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7983457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7983782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7984115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7984399Z     )
2025-05-07T20:31:53.7984748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7985207Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7985470Z         self,
2025-05-07T20:31:53.7985665Z         T: int,
2025-05-07T20:31:53.7985870Z         D: int,
2025-05-07T20:31:53.7986102Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7986376Z         contiguous: bool,
2025-05-07T20:31:53.7986635Z         compiled: bool,
2025-05-07T20:31:53.7986871Z     ) -> None:
2025-05-07T20:31:53.7987088Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7987342Z     
2025-05-07T20:31:53.7987623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7987971Z     
2025-05-07T20:31:53.7988179Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7988479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7988799Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7989049Z         x0 = x[:, :D]
2025-05-07T20:31:53.7989271Z         x1 = x[:, D:]
2025-05-07T20:31:53.7989484Z     
2025-05-07T20:31:53.7989679Z         if contiguous:
2025-05-07T20:31:53.7989915Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7990345Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7990632Z     
2025-05-07T20:31:53.7990839Z         if scale_ub is not None:
2025-05-07T20:31:53.7991119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7991468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7991779Z             )
2025-05-07T20:31:53.7991980Z         else:
2025-05-07T20:31:53.7992199Z             scale_ub_tensor = None
2025-05-07T20:31:53.7992453Z     
2025-05-07T20:31:53.7992694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7993025Z             op = silu_mul_quant
2025-05-07T20:31:53.7993275Z             if compiled:
2025-05-07T20:31:53.7993535Z                 op = torch.compile(op)
2025-05-07T20:31:53.7993836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7994112Z     
2025-05-07T20:31:53.7994312Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.7994483Z 
2025-05-07T20:31:53.7994584Z moe/activation_test.py:117: 
2025-05-07T20:31:53.7994896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7995233Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.7995520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7996242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.7996956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.7997512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7998219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7998906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7999452Z     kernel = self.compile(
2025-05-07T20:31:53.8000012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.8000695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.8001104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.8001457Z 
2025-05-07T20:31:53.8001670Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea74a0>
2025-05-07T20:31:53.8002790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.8004237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e78180>}
2025-05-07T20:31:53.8005632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.8007034Z context = <triton._C.libtriton.ir.context object at 0x7f4475237f30>
2025-05-07T20:31:53.8007337Z 
2025-05-07T20:31:53.8007514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.8008056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.8008538Z                            module_map=module_map)
2025-05-07T20:31:53.8008909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.8009274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.8009546Z E       ^
2025-05-07T20:31:53.8010020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.8010489Z 
2025-05-07T20:31:53.8010921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.8011601Z 
2025-05-07T20:31:53.8011707Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.8012214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.8012624Z     T=1,
2025-05-07T20:31:53.8012820Z     D=7168,
2025-05-07T20:31:53.8013021Z     scale_ub=1200.0,
2025-05-07T20:31:53.8013245Z     contiguous=True,
2025-05-07T20:31:53.8013481Z     compiled=True,
2025-05-07T20:31:53.8013694Z )
2025-05-07T20:31:53.8014022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.8014529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:53.8014809Z 
2025-05-07T20:31:53.8014889Z     @given(
2025-05-07T20:31:53.8015130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.8015448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.8015765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.8016105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.8016444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.8016737Z     )
2025-05-07T20:31:53.8017102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.8017552Z     def test_silu_mul_quant(
2025-05-07T20:31:53.8017804Z         self,
2025-05-07T20:31:53.8018005Z         T: int,
2025-05-07T20:31:53.8018203Z         D: int,
2025-05-07T20:31:53.8018434Z         scale_ub: Optional[float],
2025-05-07T20:31:53.8018712Z         contiguous: bool,
2025-05-07T20:31:53.8018958Z         compiled: bool,
2025-05-07T20:31:53.8019185Z     ) -> None:
2025-05-07T20:31:53.8019408Z         torch.manual_seed(2025)
2025-05-07T20:31:53.8019658Z     
2025-05-07T20:31:53.8019938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.8020298Z     
2025-05-07T20:31:53.8020530Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.8020834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.8021161Z         x = x_sign * x_clamp
2025-05-07T20:31:53.8021407Z         x0 = x[:, :D]
2025-05-07T20:31:53.8021623Z         x1 = x[:, D:]
2025-05-07T20:31:53.8021836Z     
2025-05-07T20:31:53.8022154Z         if contiguous:
2025-05-07T20:31:53.8022393Z             x0 = x0.contiguous()
2025-05-07T20:31:53.8022659Z             x1 = x1.contiguous()
2025-05-07T20:31:53.8022908Z     
2025-05-07T20:31:53.8023100Z         if scale_ub is not None:
2025-05-07T20:31:53.8023382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.8023731Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.8024051Z             )
2025-05-07T20:31:53.8024248Z         else:
2025-05-07T20:31:53.8024466Z             scale_ub_tensor = None
2025-05-07T20:31:53.8024724Z     
2025-05-07T20:31:53.8024961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.8025287Z             op = silu_mul_quant
2025-05-07T20:31:53.8025547Z             if compiled:
2025-05-07T20:31:53.8025801Z                 op = torch.compile(op)
2025-05-07T20:31:53.8026107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.8026391Z     
2025-05-07T20:31:53.8026585Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.8026768Z 
2025-05-07T20:31:53.8026870Z moe/activation_test.py:117: 
2025-05-07T20:31:53.8027179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.8027517Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.8027812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.8028392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:53.8028980Z     return fn(*args, **kwargs)
2025-05-07T20:31:53.8029660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.8030376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.8031025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.8031730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.8032426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.8032978Z     kernel = self.compile(
2025-05-07T20:31:53.8033540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.8034216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.8034628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.8034864Z 
2025-05-07T20:31:53.8035083Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea7e30>
2025-05-07T20:31:53.8036207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.8037644Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7aca0>}
2025-05-07T20:31:53.8039051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.8040118Z context = <triton._C.libtriton.ir.context object at 0x7f4475272570>
2025-05-07T20:31:53.8040414Z 
2025-05-07T20:31:53.8040591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.8041126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.8041612Z                            module_map=module_map)
2025-05-07T20:31:53.8041994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.8042360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.8042626Z E       ^
2025-05-07T20:31:53.8043188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.8043662Z 
2025-05-07T20:31:53.8044101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.8044634Z 
2025-05-07T20:31:53.8044750Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.8045174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.8045591Z     T=1,
2025-05-07T20:31:53.8045781Z     D=7168,
2025-05-07T20:31:53.8045973Z     scale_ub=1200.0,
2025-05-07T20:31:53.8046208Z     contiguous=False,
2025-05-07T20:31:53.8046439Z     compiled=True,
2025-05-07T20:31:53.8046641Z )
2025-05-07T20:31:53.9407278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.9408779Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:53.9409524Z 
2025-05-07T20:31:53.9409795Z     @given(
2025-05-07T20:31:53.9410292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.9410678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.9410992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.9411328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.9411655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.9412036Z     )
2025-05-07T20:31:53.9412399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.9412853Z     def test_silu_mul_quant(
2025-05-07T20:31:53.9413107Z         self,
2025-05-07T20:31:53.9413315Z         T: int,
2025-05-07T20:31:53.9413515Z         D: int,
2025-05-07T20:31:53.9414097Z         scale_ub: Optional[float],
2025-05-07T20:31:53.9414378Z         contiguous: bool,
2025-05-07T20:31:53.9414619Z         compiled: bool,
2025-05-07T20:31:53.9414861Z     ) -> None:
2025-05-07T20:31:53.9415086Z         torch.manual_seed(2025)
2025-05-07T20:31:53.9415339Z     
2025-05-07T20:31:53.9415612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.9415964Z     
2025-05-07T20:31:53.9416165Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.9416459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.9416776Z         x = x_sign * x_clamp
2025-05-07T20:31:53.9417023Z         x0 = x[:, :D]
2025-05-07T20:31:53.9417238Z         x1 = x[:, D:]
2025-05-07T20:31:53.9417451Z     
2025-05-07T20:31:53.9417641Z         if contiguous:
2025-05-07T20:31:53.9417872Z             x0 = x0.contiguous()
2025-05-07T20:31:53.9418139Z             x1 = x1.contiguous()
2025-05-07T20:31:53.9418382Z     
2025-05-07T20:31:53.9418570Z         if scale_ub is not None:
2025-05-07T20:31:53.9418860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.9419202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.9419513Z             )
2025-05-07T20:31:53.9419719Z         else:
2025-05-07T20:31:53.9419934Z             scale_ub_tensor = None
2025-05-07T20:31:53.9420183Z     
2025-05-07T20:31:53.9420421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.9420744Z             op = silu_mul_quant
2025-05-07T20:31:53.9421007Z             if compiled:
2025-05-07T20:31:53.9421255Z                 op = torch.compile(op)
2025-05-07T20:31:53.9421563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.9421850Z     
2025-05-07T20:31:53.9422042Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.9422215Z 
2025-05-07T20:31:53.9422316Z moe/activation_test.py:117: 
2025-05-07T20:31:53.9422625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.9422986Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.9423280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.9423863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:53.9424597Z     return fn(*args, **kwargs)
2025-05-07T20:31:53.9425276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.9425989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.9426545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.9435387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.9436136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.9436703Z     kernel = self.compile(
2025-05-07T20:31:53.9437275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.9437964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.9438386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.9438633Z 
2025-05-07T20:31:53.9438846Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea7530>
2025-05-07T20:31:53.9439970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.9441420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447521cea0>}
2025-05-07T20:31:53.9442812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.9443995Z context = <triton._C.libtriton.ir.context object at 0x7f44751315b0>
2025-05-07T20:31:53.9444309Z 
2025-05-07T20:31:53.9444482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.9445028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.9445513Z                            module_map=module_map)
2025-05-07T20:31:53.9445895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.9446260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.9446514Z E       ^
2025-05-07T20:31:53.9447002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.9447468Z 
2025-05-07T20:31:53.9447901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.9448437Z 
2025-05-07T20:31:53.9448541Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.9448976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.9449401Z     T=1,
2025-05-07T20:31:53.9449583Z     D=7168,
2025-05-07T20:31:53.9449780Z     scale_ub=None,
2025-05-07T20:31:53.9450002Z     contiguous=False,
2025-05-07T20:31:53.9450225Z     compiled=True,
2025-05-07T20:31:53.9450447Z )
2025-05-07T20:31:54.0328777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.0329542Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:54.0329911Z 
2025-05-07T20:31:54.0330022Z     @given(
2025-05-07T20:31:54.0330290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.0330834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.0331446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.0332262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.0332924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.0333481Z     )
2025-05-07T20:31:54.0334641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.0335551Z     def test_silu_mul_quant(
2025-05-07T20:31:54.0336039Z         self,
2025-05-07T20:31:54.0336423Z         T: int,
2025-05-07T20:31:54.0336816Z         D: int,
2025-05-07T20:31:54.0337252Z         scale_ub: Optional[float],
2025-05-07T20:31:54.0337789Z         contiguous: bool,
2025-05-07T20:31:54.0338269Z         compiled: bool,
2025-05-07T20:31:54.0338724Z     ) -> None:
2025-05-07T20:31:54.0339144Z         torch.manual_seed(2025)
2025-05-07T20:31:54.0339628Z     
2025-05-07T20:31:54.0340173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.0340666Z     
2025-05-07T20:31:54.0340891Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.0341194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.0341504Z         x = x_sign * x_clamp
2025-05-07T20:31:54.0341748Z         x0 = x[:, :D]
2025-05-07T20:31:54.0341969Z         x1 = x[:, D:]
2025-05-07T20:31:54.0342184Z     
2025-05-07T20:31:54.0342373Z         if contiguous:
2025-05-07T20:31:54.0342614Z             x0 = x0.contiguous()
2025-05-07T20:31:54.0342873Z             x1 = x1.contiguous()
2025-05-07T20:31:54.0343119Z     
2025-05-07T20:31:54.0343315Z         if scale_ub is not None:
2025-05-07T20:31:54.0343594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.0343932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.0344248Z             )
2025-05-07T20:31:54.0344445Z         else:
2025-05-07T20:31:54.0344654Z             scale_ub_tensor = None
2025-05-07T20:31:54.0344911Z     
2025-05-07T20:31:54.0345147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.0345464Z             op = silu_mul_quant
2025-05-07T20:31:54.0345890Z             if compiled:
2025-05-07T20:31:54.0346146Z                 op = torch.compile(op)
2025-05-07T20:31:54.0346444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.0346727Z     
2025-05-07T20:31:54.0346933Z         y_fp8, y_scale = fn()
2025-05-07T20:31:54.0347222Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:54.0347521Z     
2025-05-07T20:31:54.0347771Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.0348116Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:54.0348413Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:54.0348739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:54.0349107Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.0349423Z     
2025-05-07T20:31:54.0349629Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:54.0349828Z 
2025-05-07T20:31:54.0349937Z moe/activation_test.py:126: 
2025-05-07T20:31:54.0350248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.0350599Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:54.0350944Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.0351763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:54.0352541Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:54.0353109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.0353818Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.0354528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:54.0355282Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.0356049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:54.0356713Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:54.0357421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:54.0357961Z     fn()
2025-05-07T20:31:54.0358487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:54.0359091Z     self.fn.run(
2025-05-07T20:31:54.0359568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.0360115Z     kernel = self.compile(
2025-05-07T20:31:54.0360720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.0361401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.0361817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.0362060Z 
2025-05-07T20:31:54.0362277Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475cd5d30>
2025-05-07T20:31:54.0363407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.0364852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475caa3e0>}
2025-05-07T20:31:54.0366252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.0367399Z context = <triton._C.libtriton.ir.context object at 0x7f44770a3970>
2025-05-07T20:31:54.0367695Z 
2025-05-07T20:31:54.0367877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.0368428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.0368906Z                            module_map=module_map)
2025-05-07T20:31:54.0369281Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.0369649Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:54.0369919Z E       ^
2025-05-07T20:31:54.0370397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.0370912Z 
2025-05-07T20:31:54.0371347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.0371951Z 
2025-05-07T20:31:54.0372067Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.0372494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.0372916Z     T=1,
2025-05-07T20:31:54.0373111Z     D=5120,
2025-05-07T20:31:54.0373305Z     scale_ub=1200.0,
2025-05-07T20:31:54.0373544Z     contiguous=False,
2025-05-07T20:31:54.0373783Z     compiled=True,
2025-05-07T20:31:54.0373991Z )
2025-05-07T20:31:54.1878171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.1878926Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:54.1879316Z 
2025-05-07T20:31:54.1879437Z     @given(
2025-05-07T20:31:54.1879689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.1880009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.1880316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.1880670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.1881030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.1881346Z     )
2025-05-07T20:31:54.1881698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.1882139Z     def test_silu_mul_quant(
2025-05-07T20:31:54.1882736Z         self,
2025-05-07T20:31:54.1882936Z         T: int,
2025-05-07T20:31:54.1883125Z         D: int,
2025-05-07T20:31:54.1883339Z         scale_ub: Optional[float],
2025-05-07T20:31:54.1883607Z         contiguous: bool,
2025-05-07T20:31:54.1883838Z         compiled: bool,
2025-05-07T20:31:54.1884074Z     ) -> None:
2025-05-07T20:31:54.1884288Z         torch.manual_seed(2025)
2025-05-07T20:31:54.1884524Z     
2025-05-07T20:31:54.1884801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.1885144Z     
2025-05-07T20:31:54.1885338Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.1885632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.1885975Z         x = x_sign * x_clamp
2025-05-07T20:31:54.1886229Z         x0 = x[:, :D]
2025-05-07T20:31:54.1886451Z         x1 = x[:, D:]
2025-05-07T20:31:54.1886656Z     
2025-05-07T20:31:54.1886849Z         if contiguous:
2025-05-07T20:31:54.1887089Z             x0 = x0.contiguous()
2025-05-07T20:31:54.1887358Z             x1 = x1.contiguous()
2025-05-07T20:31:54.1887604Z     
2025-05-07T20:31:54.1887803Z         if scale_ub is not None:
2025-05-07T20:31:54.1888076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.1888417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.1888733Z             )
2025-05-07T20:31:54.1888930Z         else:
2025-05-07T20:31:54.1889142Z             scale_ub_tensor = None
2025-05-07T20:31:54.1889399Z     
2025-05-07T20:31:54.1889639Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.1889959Z             op = silu_mul_quant
2025-05-07T20:31:54.1890225Z             if compiled:
2025-05-07T20:31:54.1890575Z                 op = torch.compile(op)
2025-05-07T20:31:54.1891046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.1891327Z     
2025-05-07T20:31:54.1891530Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.1891699Z 
2025-05-07T20:31:54.1891923Z moe/activation_test.py:117: 
2025-05-07T20:31:54.1892232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.1892576Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.1892866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.1893436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.1894015Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.1894701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.1895413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.1895960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.1896675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.1897364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.1897906Z     kernel = self.compile(
2025-05-07T20:31:54.1898465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.1899144Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.1899558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.1899794Z 
2025-05-07T20:31:54.1900004Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475cd57c0>
2025-05-07T20:31:54.1901172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.1902732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44767672e0>}
2025-05-07T20:31:54.1904131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.1905182Z context = <triton._C.libtriton.ir.context object at 0x7f44750903f0>
2025-05-07T20:31:54.1905485Z 
2025-05-07T20:31:54.1905653Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.1907263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.1907754Z                            module_map=module_map)
2025-05-07T20:31:54.1908122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.1908493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.1908757Z E       ^
2025-05-07T20:31:54.1909235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.1909708Z 
2025-05-07T20:31:54.1910139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.1910680Z 
2025-05-07T20:31:54.1910784Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.1911209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.1911619Z     T=1,
2025-05-07T20:31:54.1911808Z     D=5120,
2025-05-07T20:31:54.1912012Z     scale_ub=1200.0,
2025-05-07T20:31:54.1912238Z     contiguous=False,
2025-05-07T20:31:54.1912472Z     compiled=False,
2025-05-07T20:31:54.1912687Z )
2025-05-07T20:31:54.1913009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.1913651Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:54.1913930Z 
2025-05-07T20:31:54.1914009Z     @given(
2025-05-07T20:31:54.1914248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.1914568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.1914883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.1915223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.1915557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.1915852Z     )
2025-05-07T20:31:54.1916213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.1916667Z     def test_silu_mul_quant(
2025-05-07T20:31:54.1916911Z         self,
2025-05-07T20:31:54.1917112Z         T: int,
2025-05-07T20:31:54.1917317Z         D: int,
2025-05-07T20:31:54.1917537Z         scale_ub: Optional[float],
2025-05-07T20:31:54.1917813Z         contiguous: bool,
2025-05-07T20:31:54.1918068Z         compiled: bool,
2025-05-07T20:31:54.1918291Z     ) -> None:
2025-05-07T20:31:54.1918511Z         torch.manual_seed(2025)
2025-05-07T20:31:54.1918758Z     
2025-05-07T20:31:54.1919033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.1919385Z     
2025-05-07T20:31:54.1919586Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.1919879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.1920195Z         x = x_sign * x_clamp
2025-05-07T20:31:54.1920448Z         x0 = x[:, :D]
2025-05-07T20:31:54.1920665Z         x1 = x[:, D:]
2025-05-07T20:31:54.1920875Z     
2025-05-07T20:31:54.1921066Z         if contiguous:
2025-05-07T20:31:54.1921299Z             x0 = x0.contiguous()
2025-05-07T20:31:54.1921564Z             x1 = x1.contiguous()
2025-05-07T20:31:54.1921810Z     
2025-05-07T20:31:54.1921999Z         if scale_ub is not None:
2025-05-07T20:31:54.1922278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.1922624Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.1922942Z             )
2025-05-07T20:31:54.1923136Z         else:
2025-05-07T20:31:54.1923350Z             scale_ub_tensor = None
2025-05-07T20:31:54.1923744Z     
2025-05-07T20:31:54.1923978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.1924304Z             op = silu_mul_quant
2025-05-07T20:31:54.1924561Z             if compiled:
2025-05-07T20:31:54.1924806Z                 op = torch.compile(op)
2025-05-07T20:31:54.1925111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.1925394Z     
2025-05-07T20:31:54.1925587Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.1925761Z 
2025-05-07T20:31:54.1925861Z moe/activation_test.py:117: 
2025-05-07T20:31:54.1926168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.1926508Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.1926790Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.1927506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.1928217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.1928770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.1929478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.1930165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.1930772Z     kernel = self.compile(
2025-05-07T20:31:54.1931328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.1932104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.1932513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.1932839Z 
2025-05-07T20:31:54.1933050Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475c8b9b0>
2025-05-07T20:31:54.1934176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.1935600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44762d6ca0>}
2025-05-07T20:31:54.1936989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.1938052Z context = <triton._C.libtriton.ir.context object at 0x7f447516cb30>
2025-05-07T20:31:54.1938351Z 
2025-05-07T20:31:54.1938529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.1939069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.1939557Z                            module_map=module_map)
2025-05-07T20:31:54.1939930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.1940286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.1940554Z E       ^
2025-05-07T20:31:54.1941035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.1941500Z 
2025-05-07T20:31:54.1941928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.1942467Z 
2025-05-07T20:31:54.1942573Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.1943001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.1943424Z     T=16384,
2025-05-07T20:31:54.1943619Z     D=5120,
2025-05-07T20:31:54.1943820Z     scale_ub=1200.0,
2025-05-07T20:31:54.1944053Z     contiguous=False,
2025-05-07T20:31:54.1944279Z     compiled=True,
2025-05-07T20:31:54.1944491Z )
2025-05-07T20:31:54.2815170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.2815746Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:54.2816041Z 
2025-05-07T20:31:54.2816122Z     @given(
2025-05-07T20:31:54.2816351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.2816666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.2816974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.2817303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.2817632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.2817919Z     )
2025-05-07T20:31:54.2818265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.2818725Z     def test_silu_mul_quant(
2025-05-07T20:31:54.2818968Z         self,
2025-05-07T20:31:54.2819160Z         T: int,
2025-05-07T20:31:54.2819355Z         D: int,
2025-05-07T20:31:54.2819580Z         scale_ub: Optional[float],
2025-05-07T20:31:54.2819845Z         contiguous: bool,
2025-05-07T20:31:54.2820083Z         compiled: bool,
2025-05-07T20:31:54.2820317Z     ) -> None:
2025-05-07T20:31:54.2820532Z         torch.manual_seed(2025)
2025-05-07T20:31:54.2820778Z     
2025-05-07T20:31:54.2821056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.2821408Z     
2025-05-07T20:31:54.2821599Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.2821896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.2822215Z         x = x_sign * x_clamp
2025-05-07T20:31:54.2822455Z         x0 = x[:, :D]
2025-05-07T20:31:54.2822681Z         x1 = x[:, D:]
2025-05-07T20:31:54.2822891Z     
2025-05-07T20:31:54.2823226Z         if contiguous:
2025-05-07T20:31:54.2823463Z             x0 = x0.contiguous()
2025-05-07T20:31:54.2823724Z             x1 = x1.contiguous()
2025-05-07T20:31:54.2823962Z     
2025-05-07T20:31:54.2824163Z         if scale_ub is not None:
2025-05-07T20:31:54.2824441Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.2824781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.2825103Z             )
2025-05-07T20:31:54.2825303Z         else:
2025-05-07T20:31:54.2825515Z             scale_ub_tensor = None
2025-05-07T20:31:54.2825772Z     
2025-05-07T20:31:54.2826011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.2826329Z             op = silu_mul_quant
2025-05-07T20:31:54.2826591Z             if compiled:
2025-05-07T20:31:54.2826844Z                 op = torch.compile(op)
2025-05-07T20:31:54.2827148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.2827419Z     
2025-05-07T20:31:54.2827613Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.2827786Z 
2025-05-07T20:31:54.2827892Z moe/activation_test.py:117: 
2025-05-07T20:31:54.2828188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.2828534Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.2828820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.2829394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.2829973Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.2830653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.2831364Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.2831912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.2832617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.2833310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.2833863Z     kernel = self.compile(
2025-05-07T20:31:54.2834500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.2835185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.2835597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.2835837Z 
2025-05-07T20:31:54.2836052Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762b54c0>
2025-05-07T20:31:54.2837177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.2838629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44764ef380>}
2025-05-07T20:31:54.2840044Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.2841111Z context = <triton._C.libtriton.ir.context object at 0x7f4475002830>
2025-05-07T20:31:54.2841409Z 
2025-05-07T20:31:54.2841581Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.2842120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.2842604Z                            module_map=module_map)
2025-05-07T20:31:54.2842971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.2843337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.2843622Z E       ^
2025-05-07T20:31:54.2844226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.2844695Z 
2025-05-07T20:31:54.2845130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.2845672Z 
2025-05-07T20:31:54.2845779Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.2846207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.2846624Z     T=2048,
2025-05-07T20:31:54.2846845Z     D=7168,
2025-05-07T20:31:54.2856842Z     scale_ub=1200.0,
2025-05-07T20:31:54.2857133Z     contiguous=False,
2025-05-07T20:31:54.2857371Z     compiled=True,
2025-05-07T20:31:54.2857578Z )
2025-05-07T20:31:54.2857911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.2858429Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:54.2858729Z 
2025-05-07T20:31:54.2858811Z     @given(
2025-05-07T20:31:54.2859052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.2859371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.2859682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.2860022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.2860361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.2860694Z     )
2025-05-07T20:31:54.2861055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.2861510Z     def test_silu_mul_quant(
2025-05-07T20:31:54.2861761Z         self,
2025-05-07T20:31:54.2861952Z         T: int,
2025-05-07T20:31:54.2862155Z         D: int,
2025-05-07T20:31:54.2862377Z         scale_ub: Optional[float],
2025-05-07T20:31:54.2862648Z         contiguous: bool,
2025-05-07T20:31:54.2862896Z         compiled: bool,
2025-05-07T20:31:54.2863126Z     ) -> None:
2025-05-07T20:31:54.2863347Z         torch.manual_seed(2025)
2025-05-07T20:31:54.2863595Z     
2025-05-07T20:31:54.2863875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.2864222Z     
2025-05-07T20:31:54.2864550Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.2864848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.2865168Z         x = x_sign * x_clamp
2025-05-07T20:31:54.2865407Z         x0 = x[:, :D]
2025-05-07T20:31:54.2865629Z         x1 = x[:, D:]
2025-05-07T20:31:54.2865839Z     
2025-05-07T20:31:54.2866023Z         if contiguous:
2025-05-07T20:31:54.2866254Z             x0 = x0.contiguous()
2025-05-07T20:31:54.2866513Z             x1 = x1.contiguous()
2025-05-07T20:31:54.2866823Z     
2025-05-07T20:31:54.2867089Z         if scale_ub is not None:
2025-05-07T20:31:54.2867418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.2867750Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.2868065Z             )
2025-05-07T20:31:54.2868270Z         else:
2025-05-07T20:31:54.2868478Z             scale_ub_tensor = None
2025-05-07T20:31:54.2868736Z     
2025-05-07T20:31:54.2868973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.2869293Z             op = silu_mul_quant
2025-05-07T20:31:54.2869548Z             if compiled:
2025-05-07T20:31:54.2869800Z                 op = torch.compile(op)
2025-05-07T20:31:54.2870102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.2870375Z     
2025-05-07T20:31:54.2870573Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.2870741Z 
2025-05-07T20:31:54.2870850Z moe/activation_test.py:117: 
2025-05-07T20:31:54.2871148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.2871490Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.2871782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.2872353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.2873093Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.2873773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.2874492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.2875046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.2875753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.2876442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.2876987Z     kernel = self.compile(
2025-05-07T20:31:54.2877546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.2878398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.2878917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.2879161Z 
2025-05-07T20:31:54.2879376Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762b76b0>
2025-05-07T20:31:54.2880511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.2881947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44764efd80>}
2025-05-07T20:31:54.2883356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.2884420Z context = <triton._C.libtriton.ir.context object at 0x7f4475008630>
2025-05-07T20:31:54.2884722Z 
2025-05-07T20:31:54.2884893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.2885540Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.2886028Z                            module_map=module_map)
2025-05-07T20:31:54.2886398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.2886770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.2887038Z E       ^
2025-05-07T20:31:54.2887520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.2887987Z 
2025-05-07T20:31:54.2888418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.2888960Z 
2025-05-07T20:31:54.4032411Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.4032906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.4033484Z     T=1,
2025-05-07T20:31:54.4033678Z     D=5120,
2025-05-07T20:31:54.4033871Z     scale_ub=None,
2025-05-07T20:31:54.4034092Z     contiguous=False,
2025-05-07T20:31:54.4034332Z     compiled=False,
2025-05-07T20:31:54.4034541Z )
2025-05-07T20:31:54.4034877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.4035382Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:54.4035653Z 
2025-05-07T20:31:54.4035735Z     @given(
2025-05-07T20:31:54.4035976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.4036303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.4036615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.4036958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.4037295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.4037587Z     )
2025-05-07T20:31:54.4038284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.4038740Z     def test_silu_mul_quant(
2025-05-07T20:31:54.4038988Z         self,
2025-05-07T20:31:54.4039182Z         T: int,
2025-05-07T20:31:54.4039389Z         D: int,
2025-05-07T20:31:54.4039616Z         scale_ub: Optional[float],
2025-05-07T20:31:54.4039887Z         contiguous: bool,
2025-05-07T20:31:54.4040133Z         compiled: bool,
2025-05-07T20:31:54.4040367Z     ) -> None:
2025-05-07T20:31:54.4040582Z         torch.manual_seed(2025)
2025-05-07T20:31:54.4040831Z     
2025-05-07T20:31:54.4041110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.4041455Z     
2025-05-07T20:31:54.4041655Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.4041953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.4042265Z         x = x_sign * x_clamp
2025-05-07T20:31:54.4042515Z         x0 = x[:, :D]
2025-05-07T20:31:54.4042749Z         x1 = x[:, D:]
2025-05-07T20:31:54.4042962Z     
2025-05-07T20:31:54.4043148Z         if contiguous:
2025-05-07T20:31:54.4043386Z             x0 = x0.contiguous()
2025-05-07T20:31:54.4043650Z             x1 = x1.contiguous()
2025-05-07T20:31:54.4043896Z     
2025-05-07T20:31:54.4044092Z         if scale_ub is not None:
2025-05-07T20:31:54.4044371Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.4044709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.4045030Z             )
2025-05-07T20:31:54.4045230Z         else:
2025-05-07T20:31:54.4045443Z             scale_ub_tensor = None
2025-05-07T20:31:54.4045701Z     
2025-05-07T20:31:54.4045942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.4046263Z             op = silu_mul_quant
2025-05-07T20:31:54.4046522Z             if compiled:
2025-05-07T20:31:54.4046775Z                 op = torch.compile(op)
2025-05-07T20:31:54.4047073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.4047363Z     
2025-05-07T20:31:54.4047569Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.4047737Z 
2025-05-07T20:31:54.4047846Z moe/activation_test.py:117: 
2025-05-07T20:31:54.4048324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.4048671Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.4048966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.4049674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.4050388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.4050943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.4051649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.4052426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.4052986Z     kernel = self.compile(
2025-05-07T20:31:54.4053547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.4054227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.4054642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.4054885Z 
2025-05-07T20:31:54.4055096Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476462150>
2025-05-07T20:31:54.4056215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.4057648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476edf4c0>}
2025-05-07T20:31:54.4059138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.4060208Z context = <triton._C.libtriton.ir.context object at 0x7f447504d670>
2025-05-07T20:31:54.4060503Z 
2025-05-07T20:31:54.4060678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.4061220Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.4061697Z                            module_map=module_map)
2025-05-07T20:31:54.4062069Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.4062435Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.4062698Z E       ^
2025-05-07T20:31:54.4063182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.4063653Z 
2025-05-07T20:31:54.4064089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.4064621Z 
2025-05-07T20:31:54.4064736Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.4065159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.4065574Z     T=4096,
2025-05-07T20:31:54.4065769Z     D=7168,
2025-05-07T20:31:54.4065960Z     scale_ub=1200.0,
2025-05-07T20:31:54.4066193Z     contiguous=False,
2025-05-07T20:31:54.4066426Z     compiled=False,
2025-05-07T20:31:54.4066630Z )
2025-05-07T20:31:54.4066960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.4067482Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:54.4067768Z 
2025-05-07T20:31:54.4067852Z     @given(
2025-05-07T20:31:54.4068083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.4068413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.4068729Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.4069063Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.4069485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.4069780Z     )
2025-05-07T20:31:54.4070133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.4070621Z     def test_silu_mul_quant(
2025-05-07T20:31:54.4070884Z         self,
2025-05-07T20:31:54.4071077Z         T: int,
2025-05-07T20:31:54.4071279Z         D: int,
2025-05-07T20:31:54.4071518Z         scale_ub: Optional[float],
2025-05-07T20:31:54.4071796Z         contiguous: bool,
2025-05-07T20:31:54.4072036Z         compiled: bool,
2025-05-07T20:31:54.4072265Z     ) -> None:
2025-05-07T20:31:54.4072486Z         torch.manual_seed(2025)
2025-05-07T20:31:54.4072728Z     
2025-05-07T20:31:54.4073006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.4073364Z     
2025-05-07T20:31:54.4073563Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.4073856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.4074181Z         x = x_sign * x_clamp
2025-05-07T20:31:54.4074425Z         x0 = x[:, :D]
2025-05-07T20:31:54.4074640Z         x1 = x[:, D:]
2025-05-07T20:31:54.4074852Z     
2025-05-07T20:31:54.4075042Z         if contiguous:
2025-05-07T20:31:54.4075271Z             x0 = x0.contiguous()
2025-05-07T20:31:54.4075535Z             x1 = x1.contiguous()
2025-05-07T20:31:54.4075781Z     
2025-05-07T20:31:54.4075987Z         if scale_ub is not None:
2025-05-07T20:31:54.4076269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.4076610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.4076924Z             )
2025-05-07T20:31:54.4077113Z         else:
2025-05-07T20:31:54.4077324Z             scale_ub_tensor = None
2025-05-07T20:31:54.4077694Z     
2025-05-07T20:31:54.4077927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.4078250Z             op = silu_mul_quant
2025-05-07T20:31:54.4078506Z             if compiled:
2025-05-07T20:31:54.4078756Z                 op = torch.compile(op)
2025-05-07T20:31:54.4079060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.4079341Z     
2025-05-07T20:31:54.4079528Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.4079699Z 
2025-05-07T20:31:54.4079798Z moe/activation_test.py:117: 
2025-05-07T20:31:54.4080104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.4080453Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.4080777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.4081499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.4082211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.4082765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.4083468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.4084162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.4084717Z     kernel = self.compile(
2025-05-07T20:31:54.4085267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.4085948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.4086361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.4086599Z 
2025-05-07T20:31:54.4086809Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476461520>
2025-05-07T20:31:54.4087925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.4089440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476b58540>}
2025-05-07T20:31:54.4090892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.4092049Z context = <triton._C.libtriton.ir.context object at 0x7f447508dcf0>
2025-05-07T20:31:54.4092346Z 
2025-05-07T20:31:54.4092517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.4093061Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.4093546Z                            module_map=module_map)
2025-05-07T20:31:54.4093926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.4094281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.4094549Z E       ^
2025-05-07T20:31:54.4095035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.4095501Z 
2025-05-07T20:31:54.4095933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.4096472Z 
2025-05-07T20:31:54.4096576Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.4097005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.4097428Z     T=16384,
2025-05-07T20:31:54.4097620Z     D=7168,
2025-05-07T20:31:54.4097820Z     scale_ub=None,
2025-05-07T20:31:54.4098040Z     contiguous=True,
2025-05-07T20:31:54.4098262Z     compiled=True,
2025-05-07T20:31:54.4098471Z )
2025-05-07T20:31:54.5863186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.5863992Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:54.5864385Z 
2025-05-07T20:31:54.5864493Z     @given(
2025-05-07T20:31:54.5864757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.5865087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.5865405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.5865739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.5866079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.5866369Z     )
2025-05-07T20:31:54.5866724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.5867190Z     def test_silu_mul_quant(
2025-05-07T20:31:54.5867444Z         self,
2025-05-07T20:31:54.5867640Z         T: int,
2025-05-07T20:31:54.5867851Z         D: int,
2025-05-07T20:31:54.5868072Z         scale_ub: Optional[float],
2025-05-07T20:31:54.5868353Z         contiguous: bool,
2025-05-07T20:31:54.5868600Z         compiled: bool,
2025-05-07T20:31:54.5868834Z     ) -> None:
2025-05-07T20:31:54.5869047Z         torch.manual_seed(2025)
2025-05-07T20:31:54.5869310Z     
2025-05-07T20:31:54.5869595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.5869949Z     
2025-05-07T20:31:54.5870143Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.5870450Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.5870773Z         x = x_sign * x_clamp
2025-05-07T20:31:54.5871014Z         x0 = x[:, :D]
2025-05-07T20:31:54.5871239Z         x1 = x[:, D:]
2025-05-07T20:31:54.5871456Z     
2025-05-07T20:31:54.5871640Z         if contiguous:
2025-05-07T20:31:54.5871911Z             x0 = x0.contiguous()
2025-05-07T20:31:54.5872265Z             x1 = x1.contiguous()
2025-05-07T20:31:54.5872508Z     
2025-05-07T20:31:54.5872706Z         if scale_ub is not None:
2025-05-07T20:31:54.5872997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.5873338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.5873658Z             )
2025-05-07T20:31:54.5873860Z         else:
2025-05-07T20:31:54.5874444Z             scale_ub_tensor = None
2025-05-07T20:31:54.5874716Z     
2025-05-07T20:31:54.5874964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.5875310Z             op = silu_mul_quant
2025-05-07T20:31:54.5875569Z             if compiled:
2025-05-07T20:31:54.5875826Z                 op = torch.compile(op)
2025-05-07T20:31:54.5876132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.5876409Z     
2025-05-07T20:31:54.5876613Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.5876784Z 
2025-05-07T20:31:54.5876898Z moe/activation_test.py:117: 
2025-05-07T20:31:54.5877201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5877553Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.5877854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.5878435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.5879035Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.5879729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.5880451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.5881011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.5881724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.5882420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.5883136Z     kernel = self.compile(
2025-05-07T20:31:54.5883787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.5884667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.5885089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5885331Z 
2025-05-07T20:31:54.5885546Z self = <triton.compiler.compiler.ASTSource object at 0x7f44764f1100>
2025-05-07T20:31:54.5886679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.5888138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b899e0>}
2025-05-07T20:31:54.5889544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.5890620Z context = <triton._C.libtriton.ir.context object at 0x7f4475017470>
2025-05-07T20:31:54.5890918Z 
2025-05-07T20:31:54.5891095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.5891646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.5892238Z                            module_map=module_map)
2025-05-07T20:31:54.5892616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.5892976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.5893245Z E       ^
2025-05-07T20:31:54.5893732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.5894202Z 
2025-05-07T20:31:54.5894633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.5895180Z 
2025-05-07T20:31:54.5895286Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.5895801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.5896224Z     T=4096,
2025-05-07T20:31:54.5896412Z     D=5120,
2025-05-07T20:31:54.5896610Z     scale_ub=None,
2025-05-07T20:31:54.5896835Z     contiguous=False,
2025-05-07T20:31:54.5897063Z     compiled=True,
2025-05-07T20:31:54.5897279Z )
2025-05-07T20:31:54.5897613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.5898128Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:54.5898420Z 
2025-05-07T20:31:54.5898501Z     @given(
2025-05-07T20:31:54.5898743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.5899065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.5899383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.5899733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.5900079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.5900377Z     )
2025-05-07T20:31:54.5900773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.5901260Z     def test_silu_mul_quant(
2025-05-07T20:31:54.5901506Z         self,
2025-05-07T20:31:54.5901716Z         T: int,
2025-05-07T20:31:54.5901927Z         D: int,
2025-05-07T20:31:54.5902146Z         scale_ub: Optional[float],
2025-05-07T20:31:54.5902430Z         contiguous: bool,
2025-05-07T20:31:54.5902680Z         compiled: bool,
2025-05-07T20:31:54.5902910Z     ) -> None:
2025-05-07T20:31:54.5903139Z         torch.manual_seed(2025)
2025-05-07T20:31:54.5903394Z     
2025-05-07T20:31:54.5903671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.5904029Z     
2025-05-07T20:31:54.5904228Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.5904618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.5904936Z         x = x_sign * x_clamp
2025-05-07T20:31:54.5905185Z         x0 = x[:, :D]
2025-05-07T20:31:54.5905413Z         x1 = x[:, D:]
2025-05-07T20:31:54.5905625Z     
2025-05-07T20:31:54.5905835Z         if contiguous:
2025-05-07T20:31:54.5906076Z             x0 = x0.contiguous()
2025-05-07T20:31:54.5906696Z             x1 = x1.contiguous()
2025-05-07T20:31:54.5906947Z     
2025-05-07T20:31:54.5907147Z         if scale_ub is not None:
2025-05-07T20:31:54.5907431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.5907772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.5916728Z             )
2025-05-07T20:31:54.5916950Z         else:
2025-05-07T20:31:54.5917167Z             scale_ub_tensor = None
2025-05-07T20:31:54.5917426Z     
2025-05-07T20:31:54.5917661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.5917987Z             op = silu_mul_quant
2025-05-07T20:31:54.5918269Z             if compiled:
2025-05-07T20:31:54.5918522Z                 op = torch.compile(op)
2025-05-07T20:31:54.5918832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.5919122Z     
2025-05-07T20:31:54.5919333Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.5919505Z 
2025-05-07T20:31:54.5919611Z moe/activation_test.py:117: 
2025-05-07T20:31:54.5919928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5920284Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.5920575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.5921163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.5921749Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.5922431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.5923157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.5923724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.5924617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.5925315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.5925867Z     kernel = self.compile(
2025-05-07T20:31:54.5926438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.5927116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.5927532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.5927771Z 
2025-05-07T20:31:54.5927990Z self = <triton.compiler.compiler.ASTSource object at 0x7f447641fa40>
2025-05-07T20:31:54.5929127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.5930565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b5fba0>}
2025-05-07T20:31:54.5932048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.5933121Z context = <triton._C.libtriton.ir.context object at 0x7f4475698670>
2025-05-07T20:31:54.5933420Z 
2025-05-07T20:31:54.5933600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.5934142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.5934772Z                            module_map=module_map)
2025-05-07T20:31:54.5935150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.5935519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.5935790Z E       ^
2025-05-07T20:31:54.5936275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.5936743Z 
2025-05-07T20:31:54.5937181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.5937714Z 
2025-05-07T20:31:54.7390966Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.7391653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.7392223Z     T=4096,
2025-05-07T20:31:54.7392456Z     D=5120,
2025-05-07T20:31:54.7392651Z     scale_ub=1200.0,
2025-05-07T20:31:54.7392879Z     contiguous=False,
2025-05-07T20:31:54.7393134Z     compiled=False,
2025-05-07T20:31:54.7393345Z )
2025-05-07T20:31:54.7393677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.7394191Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:54.7394497Z 
2025-05-07T20:31:54.7394578Z     @given(
2025-05-07T20:31:54.7394820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.7395136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.7395450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.7395791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.7396128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.7396415Z     )
2025-05-07T20:31:54.7396777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.7397232Z     def test_silu_mul_quant(
2025-05-07T20:31:54.7397473Z         self,
2025-05-07T20:31:54.7397673Z         T: int,
2025-05-07T20:31:54.7397880Z         D: int,
2025-05-07T20:31:54.7398099Z         scale_ub: Optional[float],
2025-05-07T20:31:54.7398380Z         contiguous: bool,
2025-05-07T20:31:54.7398625Z         compiled: bool,
2025-05-07T20:31:54.7398852Z     ) -> None:
2025-05-07T20:31:54.7399439Z         torch.manual_seed(2025)
2025-05-07T20:31:54.7399702Z     
2025-05-07T20:31:54.7399976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.7400328Z     
2025-05-07T20:31:54.7400531Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.7400826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.7401148Z         x = x_sign * x_clamp
2025-05-07T20:31:54.7401399Z         x0 = x[:, :D]
2025-05-07T20:31:54.7401628Z         x1 = x[:, D:]
2025-05-07T20:31:54.7401839Z     
2025-05-07T20:31:54.7402033Z         if contiguous:
2025-05-07T20:31:54.7402274Z             x0 = x0.contiguous()
2025-05-07T20:31:54.7402536Z             x1 = x1.contiguous()
2025-05-07T20:31:54.7402792Z     
2025-05-07T20:31:54.7402988Z         if scale_ub is not None:
2025-05-07T20:31:54.7403265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.7403610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.7403935Z             )
2025-05-07T20:31:54.7404132Z         else:
2025-05-07T20:31:54.7404356Z             scale_ub_tensor = None
2025-05-07T20:31:54.7404618Z     
2025-05-07T20:31:54.7404854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.7405182Z             op = silu_mul_quant
2025-05-07T20:31:54.7405441Z             if compiled:
2025-05-07T20:31:54.7405691Z                 op = torch.compile(op)
2025-05-07T20:31:54.7405995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.7406594Z     
2025-05-07T20:31:54.7406791Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.7406970Z 
2025-05-07T20:31:54.7407072Z moe/activation_test.py:117: 
2025-05-07T20:31:54.7407379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.7407889Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.7408175Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.7408896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.7409614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.7410167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.7410875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.7411566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.7412216Z     kernel = self.compile(
2025-05-07T20:31:54.7412773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.7413459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.7413878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.7414114Z 
2025-05-07T20:31:54.7414338Z self = <triton.compiler.compiler.ASTSource object at 0x7f44764f3950>
2025-05-07T20:31:54.7415457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.7416904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b5d440>}
2025-05-07T20:31:54.7418299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.7419366Z context = <triton._C.libtriton.ir.context object at 0x7f4474f0f6f0>
2025-05-07T20:31:54.7419663Z 
2025-05-07T20:31:54.7419843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.7420505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.7420994Z                            module_map=module_map)
2025-05-07T20:31:54.7421372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.7421731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.7422000Z E       ^
2025-05-07T20:31:54.7422482Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.7422948Z 
2025-05-07T20:31:54.7423383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.7423916Z 
2025-05-07T20:31:54.7424023Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.7424458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.7424879Z     T=4096,
2025-05-07T20:31:54.7425068Z     D=5120,
2025-05-07T20:31:54.7425274Z     scale_ub=1200.0,
2025-05-07T20:31:54.7425507Z     contiguous=False,
2025-05-07T20:31:54.7425736Z     compiled=True,
2025-05-07T20:31:54.7425947Z )
2025-05-07T20:31:54.7426278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.7426789Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:54.7427080Z 
2025-05-07T20:31:54.7427160Z     @given(
2025-05-07T20:31:54.7427403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.7427730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.7428042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.7428382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.7428811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.7429099Z     )
2025-05-07T20:31:54.7429457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.7429913Z     def test_silu_mul_quant(
2025-05-07T20:31:54.7430159Z         self,
2025-05-07T20:31:54.7430360Z         T: int,
2025-05-07T20:31:54.7430566Z         D: int,
2025-05-07T20:31:54.7430808Z         scale_ub: Optional[float],
2025-05-07T20:31:54.7431112Z         contiguous: bool,
2025-05-07T20:31:54.7431361Z         compiled: bool,
2025-05-07T20:31:54.7431591Z     ) -> None:
2025-05-07T20:31:54.7431809Z         torch.manual_seed(2025)
2025-05-07T20:31:54.7432058Z     
2025-05-07T20:31:54.7432338Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.7432683Z     
2025-05-07T20:31:54.7432882Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.7433179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.7433493Z         x = x_sign * x_clamp
2025-05-07T20:31:54.7433746Z         x0 = x[:, :D]
2025-05-07T20:31:54.7433968Z         x1 = x[:, D:]
2025-05-07T20:31:54.7434176Z     
2025-05-07T20:31:54.7434366Z         if contiguous:
2025-05-07T20:31:54.7434603Z             x0 = x0.contiguous()
2025-05-07T20:31:54.7434868Z             x1 = x1.contiguous()
2025-05-07T20:31:54.7435113Z     
2025-05-07T20:31:54.7435310Z         if scale_ub is not None:
2025-05-07T20:31:54.7435586Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.7435929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.7436247Z             )
2025-05-07T20:31:54.7436447Z         else:
2025-05-07T20:31:54.7436659Z             scale_ub_tensor = None
2025-05-07T20:31:54.7436916Z     
2025-05-07T20:31:54.7437155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.7437478Z             op = silu_mul_quant
2025-05-07T20:31:54.7437736Z             if compiled:
2025-05-07T20:31:54.7437991Z                 op = torch.compile(op)
2025-05-07T20:31:54.7438298Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.7438583Z     
2025-05-07T20:31:54.7438785Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.7438955Z 
2025-05-07T20:31:54.7439142Z moe/activation_test.py:117: 
2025-05-07T20:31:54.7439452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.7439795Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.7440080Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.7440660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:54.7441287Z     return fn(*args, **kwargs)
2025-05-07T20:31:54.7441974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.7442682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.7443240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.7443952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.7444642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.7445186Z     kernel = self.compile(
2025-05-07T20:31:54.7445742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.7446437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.7446851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.7447108Z 
2025-05-07T20:31:54.7447330Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f60c50>
2025-05-07T20:31:54.7448445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.7449997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44779ab920>}
2025-05-07T20:31:54.7451397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.7452567Z context = <triton._C.libtriton.ir.context object at 0x7f447545bcb0>
2025-05-07T20:31:54.7452865Z 
2025-05-07T20:31:54.7453038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.7453584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.7454072Z                            module_map=module_map)
2025-05-07T20:31:54.7454451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.7454816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.7455087Z E       ^
2025-05-07T20:31:54.7455575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.7456042Z 
2025-05-07T20:31:54.7456478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.7457018Z 
2025-05-07T20:31:54.8593287Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.8593967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.8594539Z     T=2048,
2025-05-07T20:31:54.8594808Z     D=7168,
2025-05-07T20:31:54.8595064Z     scale_ub=1200.0,
2025-05-07T20:31:54.8595355Z     contiguous=False,
2025-05-07T20:31:54.8595648Z     compiled=False,
2025-05-07T20:31:54.8595917Z )
2025-05-07T20:31:54.8596321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.8596872Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:54.8597162Z 
2025-05-07T20:31:54.8597250Z     @given(
2025-05-07T20:31:54.8597829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.8598160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.8598477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.8598819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.8599152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.8599446Z     )
2025-05-07T20:31:54.8599807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.8600259Z     def test_silu_mul_quant(
2025-05-07T20:31:54.8600508Z         self,
2025-05-07T20:31:54.8600707Z         T: int,
2025-05-07T20:31:54.8600904Z         D: int,
2025-05-07T20:31:54.8601129Z         scale_ub: Optional[float],
2025-05-07T20:31:54.8601415Z         contiguous: bool,
2025-05-07T20:31:54.8601658Z         compiled: bool,
2025-05-07T20:31:54.8601898Z     ) -> None:
2025-05-07T20:31:54.8602123Z         torch.manual_seed(2025)
2025-05-07T20:31:54.8602372Z     
2025-05-07T20:31:54.8602660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.8603018Z     
2025-05-07T20:31:54.8603216Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.8603520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.8603845Z         x = x_sign * x_clamp
2025-05-07T20:31:54.8604098Z         x0 = x[:, :D]
2025-05-07T20:31:54.8604315Z         x1 = x[:, D:]
2025-05-07T20:31:54.8604532Z     
2025-05-07T20:31:54.8604726Z         if contiguous:
2025-05-07T20:31:54.8604959Z             x0 = x0.contiguous()
2025-05-07T20:31:54.8605230Z             x1 = x1.contiguous()
2025-05-07T20:31:54.8605478Z     
2025-05-07T20:31:54.8605668Z         if scale_ub is not None:
2025-05-07T20:31:54.8605948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.8606764Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.8607076Z             )
2025-05-07T20:31:54.8607277Z         else:
2025-05-07T20:31:54.8607496Z             scale_ub_tensor = None
2025-05-07T20:31:54.8607759Z     
2025-05-07T20:31:54.8607996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.8608322Z             op = silu_mul_quant
2025-05-07T20:31:54.8608574Z             if compiled:
2025-05-07T20:31:54.8608832Z                 op = torch.compile(op)
2025-05-07T20:31:54.8609137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.8609418Z     
2025-05-07T20:31:54.8609612Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.8609788Z 
2025-05-07T20:31:54.8609889Z moe/activation_test.py:117: 
2025-05-07T20:31:54.8610196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.8610536Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.8610835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.8611554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.8612353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.8612916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.8613624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.8614318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.8614866Z     kernel = self.compile(
2025-05-07T20:31:54.8615429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.8616111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.8616526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.8616775Z 
2025-05-07T20:31:54.8616986Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477440560>
2025-05-07T20:31:54.8618240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.8619685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44840996c0>}
2025-05-07T20:31:54.8621081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.8622139Z context = <triton._C.libtriton.ir.context object at 0x7f4474e09a70>
2025-05-07T20:31:54.8622451Z 
2025-05-07T20:31:54.8622623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.8623168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.8623657Z                            module_map=module_map)
2025-05-07T20:31:54.8624027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.8624390Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.8624660Z E       ^
2025-05-07T20:31:54.8625137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.8625612Z 
2025-05-07T20:31:54.8626042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.8626582Z 
2025-05-07T20:31:54.8626687Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.8627119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.8627659Z     T=1,
2025-05-07T20:31:54.8627852Z     D=7168,
2025-05-07T20:31:54.8628058Z     scale_ub=None,
2025-05-07T20:31:54.8628278Z     contiguous=True,
2025-05-07T20:31:54.8628509Z     compiled=False,
2025-05-07T20:31:54.8628729Z )
2025-05-07T20:31:54.8629053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.8629561Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:54.8629840Z 
2025-05-07T20:31:54.8629920Z     @given(
2025-05-07T20:31:54.8630158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.8630479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.8630813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.8631188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.8631523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.8631816Z     )
2025-05-07T20:31:54.8632175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.8632631Z     def test_silu_mul_quant(
2025-05-07T20:31:54.8632881Z         self,
2025-05-07T20:31:54.8633084Z         T: int,
2025-05-07T20:31:54.8633288Z         D: int,
2025-05-07T20:31:54.8633513Z         scale_ub: Optional[float],
2025-05-07T20:31:54.8633792Z         contiguous: bool,
2025-05-07T20:31:54.8634038Z         compiled: bool,
2025-05-07T20:31:54.8634263Z     ) -> None:
2025-05-07T20:31:54.8634484Z         torch.manual_seed(2025)
2025-05-07T20:31:54.8634734Z     
2025-05-07T20:31:54.8635008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.8635358Z     
2025-05-07T20:31:54.8635557Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.8635853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.8636176Z         x = x_sign * x_clamp
2025-05-07T20:31:54.8636424Z         x0 = x[:, :D]
2025-05-07T20:31:54.8636639Z         x1 = x[:, D:]
2025-05-07T20:31:54.8636858Z     
2025-05-07T20:31:54.8637051Z         if contiguous:
2025-05-07T20:31:54.8637286Z             x0 = x0.contiguous()
2025-05-07T20:31:54.8637550Z             x1 = x1.contiguous()
2025-05-07T20:31:54.8637795Z     
2025-05-07T20:31:54.8638077Z         if scale_ub is not None:
2025-05-07T20:31:54.8638357Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.8638703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.8639020Z             )
2025-05-07T20:31:54.8639212Z         else:
2025-05-07T20:31:54.8639432Z             scale_ub_tensor = None
2025-05-07T20:31:54.8639690Z     
2025-05-07T20:31:54.8639921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.8640246Z             op = silu_mul_quant
2025-05-07T20:31:54.8640506Z             if compiled:
2025-05-07T20:31:54.8640754Z                 op = torch.compile(op)
2025-05-07T20:31:54.8641086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.8641396Z     
2025-05-07T20:31:54.8641593Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.8641766Z 
2025-05-07T20:31:54.8641866Z moe/activation_test.py:117: 
2025-05-07T20:31:54.8642172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.8642526Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.8642812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.8643534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.8644249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.8644801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.8645511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.8646206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.8646851Z     kernel = self.compile(
2025-05-07T20:31:54.8647411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.8648096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.8648525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.8648763Z 
2025-05-07T20:31:54.8648980Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f63d70>
2025-05-07T20:31:54.8650098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.8651521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484503c40>}
2025-05-07T20:31:54.8653025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.8654123Z context = <triton._C.libtriton.ir.context object at 0x7f4475b679f0>
2025-05-07T20:31:54.8654424Z 
2025-05-07T20:31:54.8654596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.8655148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.8655643Z                            module_map=module_map)
2025-05-07T20:31:54.8656023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.8656384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.8656653Z E       ^
2025-05-07T20:31:54.8657140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.8657615Z 
2025-05-07T20:31:54.8658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.8658612Z 
2025-05-07T20:31:54.8658718Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.8659258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.8659684Z     T=16384,
2025-05-07T20:31:54.8659877Z     D=7168,
2025-05-07T20:31:54.8660077Z     scale_ub=1200.0,
2025-05-07T20:31:54.8660314Z     contiguous=False,
2025-05-07T20:31:54.8660543Z     compiled=True,
2025-05-07T20:31:55.1069316Z )
2025-05-07T20:31:55.1069879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.1070562Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:55.1070871Z 
2025-05-07T20:31:55.1070968Z     @given(
2025-05-07T20:31:55.1071226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.1071536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.1071875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.1072212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.1072537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.1072843Z     )
2025-05-07T20:31:55.1073202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.1073652Z     def test_silu_mul_quant(
2025-05-07T20:31:55.1073907Z         self,
2025-05-07T20:31:55.1074112Z         T: int,
2025-05-07T20:31:55.1074318Z         D: int,
2025-05-07T20:31:55.1074533Z         scale_ub: Optional[float],
2025-05-07T20:31:55.1074820Z         contiguous: bool,
2025-05-07T20:31:55.1075065Z         compiled: bool,
2025-05-07T20:31:55.1075297Z     ) -> None:
2025-05-07T20:31:55.1075516Z         torch.manual_seed(2025)
2025-05-07T20:31:55.1075765Z     
2025-05-07T20:31:55.1076037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.1076744Z     
2025-05-07T20:31:55.1076944Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.1077243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.1077559Z         x = x_sign * x_clamp
2025-05-07T20:31:55.1077809Z         x0 = x[:, :D]
2025-05-07T20:31:55.1078035Z         x1 = x[:, D:]
2025-05-07T20:31:55.1078242Z     
2025-05-07T20:31:55.1078438Z         if contiguous:
2025-05-07T20:31:55.1078677Z             x0 = x0.contiguous()
2025-05-07T20:31:55.1078937Z             x1 = x1.contiguous()
2025-05-07T20:31:55.1079179Z     
2025-05-07T20:31:55.1079374Z         if scale_ub is not None:
2025-05-07T20:31:55.1079646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.1079994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.1080311Z             )
2025-05-07T20:31:55.1080500Z         else:
2025-05-07T20:31:55.1080711Z             scale_ub_tensor = None
2025-05-07T20:31:55.1080969Z     
2025-05-07T20:31:55.1081201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.1081531Z             op = silu_mul_quant
2025-05-07T20:31:55.1081789Z             if compiled:
2025-05-07T20:31:55.1082040Z                 op = torch.compile(op)
2025-05-07T20:31:55.1082343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.1082630Z     
2025-05-07T20:31:55.1082829Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.1082996Z 
2025-05-07T20:31:55.1083097Z moe/activation_test.py:117: 
2025-05-07T20:31:55.1083404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.1083746Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.1084029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.1084607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.1085186Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.1085867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.1086575Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.1087272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.1087976Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.1088659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.1089214Z     kernel = self.compile(
2025-05-07T20:31:55.1089774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.1090451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.1090864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.1091104Z 
2025-05-07T20:31:55.1091314Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477b31310>
2025-05-07T20:31:55.1092582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.1094032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484a9da80>}
2025-05-07T20:31:55.1095426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.1096492Z context = <triton._C.libtriton.ir.context object at 0x7f4475ba06b0>
2025-05-07T20:31:55.1096798Z 
2025-05-07T20:31:55.1096969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.1097515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.1098083Z                            module_map=module_map)
2025-05-07T20:31:55.1098459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.1098834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.1099101Z E       ^
2025-05-07T20:31:55.1099574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.1100044Z 
2025-05-07T20:31:55.1100475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.1101056Z 
2025-05-07T20:31:55.1101170Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.1101595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.1102004Z     T=1,
2025-05-07T20:31:55.1102194Z     D=7168,
2025-05-07T20:31:55.1102393Z     scale_ub=None,
2025-05-07T20:31:55.1102616Z     contiguous=False,
2025-05-07T20:31:55.1102845Z     compiled=False,
2025-05-07T20:31:55.1103049Z )
2025-05-07T20:31:55.1103369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.1103884Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:55.1104155Z 
2025-05-07T20:31:55.1104240Z     @given(
2025-05-07T20:31:55.1104476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.1104795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.1105107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.1105441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.1105776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.1106068Z     )
2025-05-07T20:31:55.1106824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.1107275Z     def test_silu_mul_quant(
2025-05-07T20:31:55.1107535Z         self,
2025-05-07T20:31:55.1107733Z         T: int,
2025-05-07T20:31:55.1107934Z         D: int,
2025-05-07T20:31:55.1108162Z         scale_ub: Optional[float],
2025-05-07T20:31:55.1108440Z         contiguous: bool,
2025-05-07T20:31:55.1108822Z         compiled: bool,
2025-05-07T20:31:55.1109058Z     ) -> None:
2025-05-07T20:31:55.1109279Z         torch.manual_seed(2025)
2025-05-07T20:31:55.1109520Z     
2025-05-07T20:31:55.1109800Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.1110152Z     
2025-05-07T20:31:55.1110344Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.1110639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.1110959Z         x = x_sign * x_clamp
2025-05-07T20:31:55.1111204Z         x0 = x[:, :D]
2025-05-07T20:31:55.1111419Z         x1 = x[:, D:]
2025-05-07T20:31:55.1111628Z     
2025-05-07T20:31:55.1111816Z         if contiguous:
2025-05-07T20:31:55.1112047Z             x0 = x0.contiguous()
2025-05-07T20:31:55.1112320Z             x1 = x1.contiguous()
2025-05-07T20:31:55.1112563Z     
2025-05-07T20:31:55.1112750Z         if scale_ub is not None:
2025-05-07T20:31:55.1113026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.1113376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.1113686Z             )
2025-05-07T20:31:55.1113887Z         else:
2025-05-07T20:31:55.1114104Z             scale_ub_tensor = None
2025-05-07T20:31:55.1114355Z     
2025-05-07T20:31:55.1114593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.1114916Z             op = silu_mul_quant
2025-05-07T20:31:55.1115166Z             if compiled:
2025-05-07T20:31:55.1115419Z                 op = torch.compile(op)
2025-05-07T20:31:55.1115723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.1116001Z     
2025-05-07T20:31:55.1116200Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.1116372Z 
2025-05-07T20:31:55.1116473Z moe/activation_test.py:117: 
2025-05-07T20:31:55.1116957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.1117297Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.1117591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.1118326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.1119047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.1119611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.1120329Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.1121058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.1121637Z     kernel = self.compile(
2025-05-07T20:31:55.1122201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.1122895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.1123305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.1123556Z 
2025-05-07T20:31:55.1123770Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477b309e0>
2025-05-07T20:31:55.1124912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.1126371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484b756c0>}
2025-05-07T20:31:55.1127792Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.1128874Z context = <triton._C.libtriton.ir.context object at 0x7f4475bee370>
2025-05-07T20:31:55.1129179Z 
2025-05-07T20:31:55.1129432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.1129984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.1130471Z                            module_map=module_map)
2025-05-07T20:31:55.1130843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.1131213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.1131480Z E       ^
2025-05-07T20:31:55.1132040Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.1132524Z 
2025-05-07T20:31:55.1132962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.1133525Z 
2025-05-07T20:31:55.1133631Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.1134061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.1134477Z     T=2048,
2025-05-07T20:31:55.1134677Z     D=7168,
2025-05-07T20:31:55.1134876Z     scale_ub=None,
2025-05-07T20:31:55.1135093Z     contiguous=False,
2025-05-07T20:31:55.1135328Z     compiled=True,
2025-05-07T20:31:55.1135536Z )
2025-05-07T20:31:55.2004950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2006922Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:55.2007701Z 
2025-05-07T20:31:55.2007911Z     @given(
2025-05-07T20:31:55.2008472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2009096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2009698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2010346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2011128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2011416Z     )
2025-05-07T20:31:55.2011844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2012299Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2012545Z         self,
2025-05-07T20:31:55.2012737Z         T: int,
2025-05-07T20:31:55.2012923Z         D: int,
2025-05-07T20:31:55.2013138Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2013406Z         contiguous: bool,
2025-05-07T20:31:55.2013651Z         compiled: bool,
2025-05-07T20:31:55.2013882Z     ) -> None:
2025-05-07T20:31:55.2014102Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2014349Z     
2025-05-07T20:31:55.2014619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2014966Z     
2025-05-07T20:31:55.2015165Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2015456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2015784Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2016035Z         x0 = x[:, :D]
2025-05-07T20:31:55.2016256Z         x1 = x[:, D:]
2025-05-07T20:31:55.2016468Z     
2025-05-07T20:31:55.2016673Z         if contiguous:
2025-05-07T20:31:55.2016906Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2017174Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2017420Z     
2025-05-07T20:31:55.2017611Z         if scale_ub is not None:
2025-05-07T20:31:55.2017892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2018235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2018543Z             )
2025-05-07T20:31:55.2018738Z         else:
2025-05-07T20:31:55.2018953Z             scale_ub_tensor = None
2025-05-07T20:31:55.2019200Z     
2025-05-07T20:31:55.2019434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2019754Z             op = silu_mul_quant
2025-05-07T20:31:55.2020009Z             if compiled:
2025-05-07T20:31:55.2020258Z                 op = torch.compile(op)
2025-05-07T20:31:55.2020558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2020839Z     
2025-05-07T20:31:55.2021028Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.2021464Z 
2025-05-07T20:31:55.2021570Z moe/activation_test.py:117: 
2025-05-07T20:31:55.2021872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2022204Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.2022490Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2023069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.2023644Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.2024319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.2025029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.2025590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2026283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2026972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2027521Z     kernel = self.compile(
2025-05-07T20:31:55.2028076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2028748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2029159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2029393Z 
2025-05-07T20:31:55.2029608Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a1ca0>
2025-05-07T20:31:55.2030725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2033140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486dab060>}
2025-05-07T20:31:55.2035205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2036791Z context = <triton._C.libtriton.ir.context object at 0x7f447578d6f0>
2025-05-07T20:31:55.2037212Z 
2025-05-07T20:31:55.2037445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2038271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2039021Z                            module_map=module_map)
2025-05-07T20:31:55.2039600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2040153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.2040547Z E       ^
2025-05-07T20:31:55.2041357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2042095Z 
2025-05-07T20:31:55.2042801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.2043675Z 
2025-05-07T20:31:55.2043837Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2044496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2045140Z     T=4096,
2025-05-07T20:31:55.2045426Z     D=7168,
2025-05-07T20:31:55.2045704Z     scale_ub=None,
2025-05-07T20:31:55.2046065Z     contiguous=False,
2025-05-07T20:31:55.2046414Z     compiled=True,
2025-05-07T20:31:55.2046729Z )
2025-05-07T20:31:55.2047072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2047592Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:55.2047883Z 
2025-05-07T20:31:55.2048091Z     @given(
2025-05-07T20:31:55.2048337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2048658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2048977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2049327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2049662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2049958Z     )
2025-05-07T20:31:55.2050323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2050775Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2051028Z         self,
2025-05-07T20:31:55.2051241Z         T: int,
2025-05-07T20:31:55.2051448Z         D: int,
2025-05-07T20:31:55.2051677Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2052064Z         contiguous: bool,
2025-05-07T20:31:55.2052310Z         compiled: bool,
2025-05-07T20:31:55.2052532Z     ) -> None:
2025-05-07T20:31:55.2052751Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2053004Z     
2025-05-07T20:31:55.2053278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2053631Z     
2025-05-07T20:31:55.2053834Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2054128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2054446Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2054694Z         x0 = x[:, :D]
2025-05-07T20:31:55.2054911Z         x1 = x[:, D:]
2025-05-07T20:31:55.2055121Z     
2025-05-07T20:31:55.2055315Z         if contiguous:
2025-05-07T20:31:55.2055545Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2055808Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2056058Z     
2025-05-07T20:31:55.2056249Z         if scale_ub is not None:
2025-05-07T20:31:55.2056625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2056973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2057288Z             )
2025-05-07T20:31:55.2057485Z         else:
2025-05-07T20:31:55.2057698Z             scale_ub_tensor = None
2025-05-07T20:31:55.2057957Z     
2025-05-07T20:31:55.2058188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2058514Z             op = silu_mul_quant
2025-05-07T20:31:55.2058773Z             if compiled:
2025-05-07T20:31:55.2059023Z                 op = torch.compile(op)
2025-05-07T20:31:55.2059329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2059610Z     
2025-05-07T20:31:55.2059800Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.2059976Z 
2025-05-07T20:31:55.2060080Z moe/activation_test.py:117: 
2025-05-07T20:31:55.2060382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2060726Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.2061040Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2061645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.2062231Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.2062908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.2063620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.2064173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2064877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2065561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2066113Z     kernel = self.compile(
2025-05-07T20:31:55.2066674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2067357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2067859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2068107Z 
2025-05-07T20:31:55.2068323Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a1880>
2025-05-07T20:31:55.2069447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2070874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484df04a0>}
2025-05-07T20:31:55.2072269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2073336Z context = <triton._C.libtriton.ir.context object at 0x7f447575ccf0>
2025-05-07T20:31:55.2073636Z 
2025-05-07T20:31:55.2073814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2074359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2074837Z                            module_map=module_map)
2025-05-07T20:31:55.2075215Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2075587Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.2075853Z E       ^
2025-05-07T20:31:55.2076340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2076809Z 
2025-05-07T20:31:55.2077246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.2077860Z 
2025-05-07T20:31:55.3661203Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.3661714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.3662128Z     T=16384,
2025-05-07T20:31:55.3662334Z     D=5120,
2025-05-07T20:31:55.3662531Z     scale_ub=1200.0,
2025-05-07T20:31:55.3662750Z     contiguous=False,
2025-05-07T20:31:55.3662997Z     compiled=False,
2025-05-07T20:31:55.3663212Z )
2025-05-07T20:31:55.3663545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.3664074Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:55.3664378Z 
2025-05-07T20:31:55.3664460Z     @given(
2025-05-07T20:31:55.3664707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.3665026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.3665346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.3665699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.3666033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.3666330Z     )
2025-05-07T20:31:55.3666701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.3667153Z     def test_silu_mul_quant(
2025-05-07T20:31:55.3667406Z         self,
2025-05-07T20:31:55.3667615Z         T: int,
2025-05-07T20:31:55.3667816Z         D: int,
2025-05-07T20:31:55.3668049Z         scale_ub: Optional[float],
2025-05-07T20:31:55.3668333Z         contiguous: bool,
2025-05-07T20:31:55.3668573Z         compiled: bool,
2025-05-07T20:31:55.3668809Z     ) -> None:
2025-05-07T20:31:55.3669037Z         torch.manual_seed(2025)
2025-05-07T20:31:55.3669288Z     
2025-05-07T20:31:55.3669570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.3669928Z     
2025-05-07T20:31:55.3670131Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.3670434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.3670757Z         x = x_sign * x_clamp
2025-05-07T20:31:55.3671023Z         x0 = x[:, :D]
2025-05-07T20:31:55.3671458Z         x1 = x[:, D:]
2025-05-07T20:31:55.3671678Z     
2025-05-07T20:31:55.3671870Z         if contiguous:
2025-05-07T20:31:55.3672102Z             x0 = x0.contiguous()
2025-05-07T20:31:55.3672366Z             x1 = x1.contiguous()
2025-05-07T20:31:55.3672608Z     
2025-05-07T20:31:55.3672794Z         if scale_ub is not None:
2025-05-07T20:31:55.3673079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.3673421Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.3673729Z             )
2025-05-07T20:31:55.3673927Z         else:
2025-05-07T20:31:55.3674142Z             scale_ub_tensor = None
2025-05-07T20:31:55.3674387Z     
2025-05-07T20:31:55.3674633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.3683347Z             op = silu_mul_quant
2025-05-07T20:31:55.3683622Z             if compiled:
2025-05-07T20:31:55.3683874Z                 op = torch.compile(op)
2025-05-07T20:31:55.3684186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.3684480Z     
2025-05-07T20:31:55.3684676Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.3684847Z 
2025-05-07T20:31:55.3684946Z moe/activation_test.py:117: 
2025-05-07T20:31:55.3685244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.3685579Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.3685870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.3686597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.3687314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.3687866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.3688761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.3689451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.3690005Z     kernel = self.compile(
2025-05-07T20:31:55.3690568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.3691249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.3691664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.3691963Z 
2025-05-07T20:31:55.3692174Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a33b0>
2025-05-07T20:31:55.3693301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.3694737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44a9babd80>}
2025-05-07T20:31:55.3696135Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.3697187Z context = <triton._C.libtriton.ir.context object at 0x7f44756dfdf0>
2025-05-07T20:31:55.3697489Z 
2025-05-07T20:31:55.3697659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.3698195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.3698674Z                            module_map=module_map)
2025-05-07T20:31:55.3699043Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.3699408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.3699672Z E       ^
2025-05-07T20:31:55.3700145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.3700709Z 
2025-05-07T20:31:55.3701141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.3701678Z 
2025-05-07T20:31:55.3701780Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.3702227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.3702647Z     T=16384,
2025-05-07T20:31:55.3702839Z     D=5120,
2025-05-07T20:31:55.3703036Z     scale_ub=1200.0,
2025-05-07T20:31:55.3703263Z     contiguous=True,
2025-05-07T20:31:55.3703486Z     compiled=True,
2025-05-07T20:31:55.3703691Z )
2025-05-07T20:31:55.3704019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.3704526Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:55.3704817Z 
2025-05-07T20:31:55.3704893Z     @given(
2025-05-07T20:31:55.3705124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.3705442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.3705752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.3706088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.3706781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.3707160Z     )
2025-05-07T20:31:55.3707652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.3708308Z     def test_silu_mul_quant(
2025-05-07T20:31:55.3708615Z         self,
2025-05-07T20:31:55.3708865Z         T: int,
2025-05-07T20:31:55.3709118Z         D: int,
2025-05-07T20:31:55.3709394Z         scale_ub: Optional[float],
2025-05-07T20:31:55.3709695Z         contiguous: bool,
2025-05-07T20:31:55.3710113Z         compiled: bool,
2025-05-07T20:31:55.3710336Z     ) -> None:
2025-05-07T20:31:55.3710556Z         torch.manual_seed(2025)
2025-05-07T20:31:55.3710802Z     
2025-05-07T20:31:55.3711078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.3711427Z     
2025-05-07T20:31:55.3711618Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.3711913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.3712224Z         x = x_sign * x_clamp
2025-05-07T20:31:55.3712469Z         x0 = x[:, :D]
2025-05-07T20:31:55.3712685Z         x1 = x[:, D:]
2025-05-07T20:31:55.3712888Z     
2025-05-07T20:31:55.3713073Z         if contiguous:
2025-05-07T20:31:55.3713305Z             x0 = x0.contiguous()
2025-05-07T20:31:55.3713558Z             x1 = x1.contiguous()
2025-05-07T20:31:55.3713802Z     
2025-05-07T20:31:55.3713996Z         if scale_ub is not None:
2025-05-07T20:31:55.3714268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.3714608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.3714937Z             )
2025-05-07T20:31:55.3715126Z         else:
2025-05-07T20:31:55.3715339Z             scale_ub_tensor = None
2025-05-07T20:31:55.3715600Z     
2025-05-07T20:31:55.3715832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.3716154Z             op = silu_mul_quant
2025-05-07T20:31:55.3716414Z             if compiled:
2025-05-07T20:31:55.3716657Z                 op = torch.compile(op)
2025-05-07T20:31:55.3716960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.3717239Z     
2025-05-07T20:31:55.3717431Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.3717598Z 
2025-05-07T20:31:55.3717697Z moe/activation_test.py:117: 
2025-05-07T20:31:55.3718000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.3718339Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.3718616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.3719196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.3719768Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.3720578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.3721283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.3721833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.3722535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.3723218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.3723766Z     kernel = self.compile(
2025-05-07T20:31:55.3724320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.3724999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.3725407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.3725649Z 
2025-05-07T20:31:55.3725865Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484417d40>
2025-05-07T20:31:55.3726982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.3728404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484d772e0>}
2025-05-07T20:31:55.3729783Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.3730948Z context = <triton._C.libtriton.ir.context object at 0x7f44756e3370>
2025-05-07T20:31:55.3731268Z 
2025-05-07T20:31:55.3731437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.3732040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.3732518Z                            module_map=module_map)
2025-05-07T20:31:55.3732889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.3733251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.3733515Z E       ^
2025-05-07T20:31:55.3733985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.3734460Z 
2025-05-07T20:31:55.3734886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.3735414Z 
2025-05-07T20:31:55.5416890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.5417363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.5417957Z     T=16384,
2025-05-07T20:31:55.5418162Z     D=5120,
2025-05-07T20:31:55.5418368Z     scale_ub=None,
2025-05-07T20:31:55.5418595Z     contiguous=False,
2025-05-07T20:31:55.5418831Z     compiled=True,
2025-05-07T20:31:55.5419047Z )
2025-05-07T20:31:55.5419377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.5419903Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:55.5420194Z 
2025-05-07T20:31:55.5420285Z     @given(
2025-05-07T20:31:55.5420521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.5420855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.5421180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.5421517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.5421874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.5422175Z     )
2025-05-07T20:31:55.5422543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.5423165Z     def test_silu_mul_quant(
2025-05-07T20:31:55.5423430Z         self,
2025-05-07T20:31:55.5423645Z         T: int,
2025-05-07T20:31:55.5423851Z         D: int,
2025-05-07T20:31:55.5424088Z         scale_ub: Optional[float],
2025-05-07T20:31:55.5424375Z         contiguous: bool,
2025-05-07T20:31:55.5424622Z         compiled: bool,
2025-05-07T20:31:55.5424863Z     ) -> None:
2025-05-07T20:31:55.5425096Z         torch.manual_seed(2025)
2025-05-07T20:31:55.5425341Z     
2025-05-07T20:31:55.5425629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.5425986Z     
2025-05-07T20:31:55.5426183Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.5426487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.5426819Z         x = x_sign * x_clamp
2025-05-07T20:31:55.5427066Z         x0 = x[:, :D]
2025-05-07T20:31:55.5427292Z         x1 = x[:, D:]
2025-05-07T20:31:55.5427508Z     
2025-05-07T20:31:55.5427695Z         if contiguous:
2025-05-07T20:31:55.5427943Z             x0 = x0.contiguous()
2025-05-07T20:31:55.5428212Z             x1 = x1.contiguous()
2025-05-07T20:31:55.5428458Z     
2025-05-07T20:31:55.5428650Z         if scale_ub is not None:
2025-05-07T20:31:55.5428931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.5429281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.5429596Z             )
2025-05-07T20:31:55.5429792Z         else:
2025-05-07T20:31:55.5430010Z             scale_ub_tensor = None
2025-05-07T20:31:55.5430263Z     
2025-05-07T20:31:55.5430503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.5430830Z             op = silu_mul_quant
2025-05-07T20:31:55.5431085Z             if compiled:
2025-05-07T20:31:55.5431341Z                 op = torch.compile(op)
2025-05-07T20:31:55.5431775Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.5432055Z     
2025-05-07T20:31:55.5432255Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.5432423Z 
2025-05-07T20:31:55.5432536Z moe/activation_test.py:117: 
2025-05-07T20:31:55.5432844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.5433182Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.5433474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.5434054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.5434631Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.5435317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.5436042Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.5436602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.5437313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.5438010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.5438570Z     kernel = self.compile(
2025-05-07T20:31:55.5439131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.5439817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.5440234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.5440473Z 
2025-05-07T20:31:55.5440691Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484012480>
2025-05-07T20:31:55.5441868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.5443421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447608f920>}
2025-05-07T20:31:55.5444823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.5445893Z context = <triton._C.libtriton.ir.context object at 0x7f4474b09530>
2025-05-07T20:31:55.5446189Z 
2025-05-07T20:31:55.5446367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.5446910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.5447398Z                            module_map=module_map)
2025-05-07T20:31:55.5447780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.5448139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.5448410Z E       ^
2025-05-07T20:31:55.5448896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.5449365Z 
2025-05-07T20:31:55.5449801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.5450334Z 
2025-05-07T20:31:55.5450439Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.5450869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.5451287Z     T=2048,
2025-05-07T20:31:55.5451473Z     D=5120,
2025-05-07T20:31:55.5451669Z     scale_ub=None,
2025-05-07T20:31:55.5451965Z     contiguous=False,
2025-05-07T20:31:55.5452191Z     compiled=True,
2025-05-07T20:31:55.5452400Z )
2025-05-07T20:31:55.8370764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.8372460Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:55.8372753Z 
2025-05-07T20:31:55.8372831Z     @given(
2025-05-07T20:31:55.8373078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.8373393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.8373706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.8374050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.8374380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.8374671Z     )
2025-05-07T20:31:55.8375034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.8375488Z     def test_silu_mul_quant(
2025-05-07T20:31:55.8375744Z         self,
2025-05-07T20:31:55.8375949Z         T: int,
2025-05-07T20:31:55.8376148Z         D: int,
2025-05-07T20:31:55.8376373Z         scale_ub: Optional[float],
2025-05-07T20:31:55.8376664Z         contiguous: bool,
2025-05-07T20:31:55.8376913Z         compiled: bool,
2025-05-07T20:31:55.8377140Z     ) -> None:
2025-05-07T20:31:55.8377367Z         torch.manual_seed(2025)
2025-05-07T20:31:55.8377623Z     
2025-05-07T20:31:55.8377904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.8378260Z     
2025-05-07T20:31:55.8378463Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.8378756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.8379081Z         x = x_sign * x_clamp
2025-05-07T20:31:55.8379329Z         x0 = x[:, :D]
2025-05-07T20:31:55.8379546Z         x1 = x[:, D:]
2025-05-07T20:31:55.8379756Z     
2025-05-07T20:31:55.8379946Z         if contiguous:
2025-05-07T20:31:55.8380177Z             x0 = x0.contiguous()
2025-05-07T20:31:55.8380453Z             x1 = x1.contiguous()
2025-05-07T20:31:55.8380699Z     
2025-05-07T20:31:55.8380890Z         if scale_ub is not None:
2025-05-07T20:31:55.8381168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.8381522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.8381844Z             )
2025-05-07T20:31:55.8382038Z         else:
2025-05-07T20:31:55.8382388Z             scale_ub_tensor = None
2025-05-07T20:31:55.8382653Z     
2025-05-07T20:31:55.8382887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.8383216Z             op = silu_mul_quant
2025-05-07T20:31:55.8383473Z             if compiled:
2025-05-07T20:31:55.8383719Z                 op = torch.compile(op)
2025-05-07T20:31:55.8384022Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.8384304Z     
2025-05-07T20:31:55.8384496Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.8384670Z 
2025-05-07T20:31:55.8384773Z moe/activation_test.py:117: 
2025-05-07T20:31:55.8385082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.8385412Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.8385702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.8386277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.8386851Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.8387527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.8388232Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.8388784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.8389484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.8390161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.8390707Z     kernel = self.compile(
2025-05-07T20:31:55.8391260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.8392019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.8392426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.8392674Z 
2025-05-07T20:31:55.8392886Z self = <triton.compiler.compiler.ASTSource object at 0x7f44849c6450>
2025-05-07T20:31:55.8394001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.8395420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447608df80>}
2025-05-07T20:31:55.8396815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.8397880Z context = <triton._C.libtriton.ir.context object at 0x7f44758dfdf0>
2025-05-07T20:31:55.8398174Z 
2025-05-07T20:31:55.8398356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.8398896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.8399370Z                            module_map=module_map)
2025-05-07T20:31:55.8399743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.8400108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.8400363Z E       ^
2025-05-07T20:31:55.8400839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.8401303Z 
2025-05-07T20:31:55.8401740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.8402275Z 
2025-05-07T20:31:55.8402385Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.8402805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.8403302Z     T=2048,
2025-05-07T20:31:55.8403496Z     D=5120,
2025-05-07T20:31:55.8403685Z     scale_ub=1200.0,
2025-05-07T20:31:55.8403914Z     contiguous=False,
2025-05-07T20:31:55.8404141Z     compiled=True,
2025-05-07T20:31:55.8404343Z )
2025-05-07T20:31:55.8404669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.8405180Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:55.8405461Z 
2025-05-07T20:31:55.8405545Z     @given(
2025-05-07T20:31:55.8405772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.8406091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.8406659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.8406999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.8407335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.8407624Z     )
2025-05-07T20:31:55.8407984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.8408437Z     def test_silu_mul_quant(
2025-05-07T20:31:55.8408683Z         self,
2025-05-07T20:31:55.8408874Z         T: int,
2025-05-07T20:31:55.8409073Z         D: int,
2025-05-07T20:31:55.8409295Z         scale_ub: Optional[float],
2025-05-07T20:31:55.8409564Z         contiguous: bool,
2025-05-07T20:31:55.8409806Z         compiled: bool,
2025-05-07T20:31:55.8410033Z     ) -> None:
2025-05-07T20:31:55.8410246Z         torch.manual_seed(2025)
2025-05-07T20:31:55.8410491Z     
2025-05-07T20:31:55.8410768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.8411118Z     
2025-05-07T20:31:55.8411310Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.8411602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.8412110Z         x = x_sign * x_clamp
2025-05-07T20:31:55.8412348Z         x0 = x[:, :D]
2025-05-07T20:31:55.8412567Z         x1 = x[:, D:]
2025-05-07T20:31:55.8412778Z     
2025-05-07T20:31:55.8412961Z         if contiguous:
2025-05-07T20:31:55.8413194Z             x0 = x0.contiguous()
2025-05-07T20:31:55.8413453Z             x1 = x1.contiguous()
2025-05-07T20:31:55.8413687Z     
2025-05-07T20:31:55.8413880Z         if scale_ub is not None:
2025-05-07T20:31:55.8414156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.8414489Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.8414803Z             )
2025-05-07T20:31:55.8414995Z         else:
2025-05-07T20:31:55.8415202Z             scale_ub_tensor = None
2025-05-07T20:31:55.8415457Z     
2025-05-07T20:31:55.8415691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.8416010Z             op = silu_mul_quant
2025-05-07T20:31:55.8416265Z             if compiled:
2025-05-07T20:31:55.8416518Z                 op = torch.compile(op)
2025-05-07T20:31:55.8416817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.8417089Z     
2025-05-07T20:31:55.8417287Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.8417454Z 
2025-05-07T20:31:55.8417560Z moe/activation_test.py:117: 
2025-05-07T20:31:55.8417858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.8418196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.8418485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.8419055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:55.8419629Z     return fn(*args, **kwargs)
2025-05-07T20:31:55.8420306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.8421011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.8421561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.8422382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.8423072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.8423620Z     kernel = self.compile(
2025-05-07T20:31:55.8424170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.8424845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.8425251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.8425486Z 
2025-05-07T20:31:55.8425699Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d1dfa0>
2025-05-07T20:31:55.8426815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.8428249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476fdca40>}
2025-05-07T20:31:55.8429637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.8430696Z context = <triton._C.libtriton.ir.context object at 0x7f4475849670>
2025-05-07T20:31:55.8431014Z 
2025-05-07T20:31:55.8431209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.8431745Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.8432333Z                            module_map=module_map)
2025-05-07T20:31:55.8432706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.8433062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.8433328Z E       ^
2025-05-07T20:31:55.8433812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.8434278Z 
2025-05-07T20:31:55.8434709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.8435243Z 
2025-05-07T20:31:56.0157367Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.0157999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.0158611Z     T=4096,
2025-05-07T20:31:56.0158863Z     D=5120,
2025-05-07T20:31:56.0159119Z     scale_ub=1200.0,
2025-05-07T20:31:56.0159348Z     contiguous=True,
2025-05-07T20:31:56.0159571Z     compiled=True,
2025-05-07T20:31:56.0159804Z )
2025-05-07T20:31:56.0160147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.0160659Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:56.0160955Z 
2025-05-07T20:31:56.0161047Z     @given(
2025-05-07T20:31:56.0161330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.0161664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.0161981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.0162330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.0162673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.0162959Z     )
2025-05-07T20:31:56.0163328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.0163784Z     def test_silu_mul_quant(
2025-05-07T20:31:56.0164041Z         self,
2025-05-07T20:31:56.0164246Z         T: int,
2025-05-07T20:31:56.0164444Z         D: int,
2025-05-07T20:31:56.0164676Z         scale_ub: Optional[float],
2025-05-07T20:31:56.0164962Z         contiguous: bool,
2025-05-07T20:31:56.0165205Z         compiled: bool,
2025-05-07T20:31:56.0165440Z     ) -> None:
2025-05-07T20:31:56.0165835Z         torch.manual_seed(2025)
2025-05-07T20:31:56.0166091Z     
2025-05-07T20:31:56.0166370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.0166725Z     
2025-05-07T20:31:56.0166927Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.0167226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.0167550Z         x = x_sign * x_clamp
2025-05-07T20:31:56.0167804Z         x0 = x[:, :D]
2025-05-07T20:31:56.0168026Z         x1 = x[:, D:]
2025-05-07T20:31:56.0168244Z     
2025-05-07T20:31:56.0168442Z         if contiguous:
2025-05-07T20:31:56.0168681Z             x0 = x0.contiguous()
2025-05-07T20:31:56.0168957Z             x1 = x1.contiguous()
2025-05-07T20:31:56.0169215Z     
2025-05-07T20:31:56.0169414Z         if scale_ub is not None:
2025-05-07T20:31:56.0169701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.0170048Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.0170357Z             )
2025-05-07T20:31:56.0170569Z         else:
2025-05-07T20:31:56.0170789Z             scale_ub_tensor = None
2025-05-07T20:31:56.0171050Z     
2025-05-07T20:31:56.0171289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.0171623Z             op = silu_mul_quant
2025-05-07T20:31:56.0171983Z             if compiled:
2025-05-07T20:31:56.0172229Z                 op = torch.compile(op)
2025-05-07T20:31:56.0172530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.0172809Z     
2025-05-07T20:31:56.0173000Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.0173170Z 
2025-05-07T20:31:56.0173272Z moe/activation_test.py:117: 
2025-05-07T20:31:56.0173576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.0174051Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.0174339Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.0174921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.0175500Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.0176174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.0176885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.0177437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.0178133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.0178818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.0179367Z     kernel = self.compile(
2025-05-07T20:31:56.0179933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.0180603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.0181056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.0181308Z 
2025-05-07T20:31:56.0181531Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d9ae10>
2025-05-07T20:31:56.0182650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.0184070Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476fde2a0>}
2025-05-07T20:31:56.0185466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.0186611Z context = <triton._C.libtriton.ir.context object at 0x7f4474df7bf0>
2025-05-07T20:31:56.0186909Z 
2025-05-07T20:31:56.0187090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.0187626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.0188108Z                            module_map=module_map)
2025-05-07T20:31:56.0188485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.0188852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.0189114Z E       ^
2025-05-07T20:31:56.0189599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.0190063Z 
2025-05-07T20:31:56.0190500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.0191052Z 
2025-05-07T20:31:56.0191177Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.0191629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.0192047Z     T=128,
2025-05-07T20:31:56.0192240Z     D=5120,
2025-05-07T20:31:56.0192437Z     scale_ub=1200.0,
2025-05-07T20:31:56.0192667Z     contiguous=False,
2025-05-07T20:31:56.0192900Z     compiled=True,
2025-05-07T20:31:56.0193104Z )
2025-05-07T20:31:56.1197621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.1199143Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:56.1199917Z 
2025-05-07T20:31:56.1200133Z     @given(
2025-05-07T20:31:56.1200650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.1201204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.1201767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.1202109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.1202449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.1202741Z     )
2025-05-07T20:31:56.1203115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.1203575Z     def test_silu_mul_quant(
2025-05-07T20:31:56.1203823Z         self,
2025-05-07T20:31:56.1204031Z         T: int,
2025-05-07T20:31:56.1204245Z         D: int,
2025-05-07T20:31:56.1204468Z         scale_ub: Optional[float],
2025-05-07T20:31:56.1204754Z         contiguous: bool,
2025-05-07T20:31:56.1205006Z         compiled: bool,
2025-05-07T20:31:56.1205234Z     ) -> None:
2025-05-07T20:31:56.1205458Z         torch.manual_seed(2025)
2025-05-07T20:31:56.1205707Z     
2025-05-07T20:31:56.1205981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.1206513Z     
2025-05-07T20:31:56.1206723Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.1207016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.1207339Z         x = x_sign * x_clamp
2025-05-07T20:31:56.1207588Z         x0 = x[:, :D]
2025-05-07T20:31:56.1207818Z         x1 = x[:, D:]
2025-05-07T20:31:56.1208023Z     
2025-05-07T20:31:56.1208216Z         if contiguous:
2025-05-07T20:31:56.1208454Z             x0 = x0.contiguous()
2025-05-07T20:31:56.1208714Z             x1 = x1.contiguous()
2025-05-07T20:31:56.1208958Z     
2025-05-07T20:31:56.1209159Z         if scale_ub is not None:
2025-05-07T20:31:56.1209434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.1209775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.1210092Z             )
2025-05-07T20:31:56.1210283Z         else:
2025-05-07T20:31:56.1210502Z             scale_ub_tensor = None
2025-05-07T20:31:56.1210758Z     
2025-05-07T20:31:56.1210991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1211319Z             op = silu_mul_quant
2025-05-07T20:31:56.1211583Z             if compiled:
2025-05-07T20:31:56.1211904Z                 op = torch.compile(op)
2025-05-07T20:31:56.1212209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1212621Z     
2025-05-07T20:31:56.1212824Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.1212991Z 
2025-05-07T20:31:56.1213098Z moe/activation_test.py:117: 
2025-05-07T20:31:56.1213403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1213752Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.1214034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1214615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.1215193Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.1215871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.1216590Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.1217138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.1217843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.1218520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.1219066Z     kernel = self.compile(
2025-05-07T20:31:56.1219619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.1220296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.1220697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1220938Z 
2025-05-07T20:31:56.1221173Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487655520>
2025-05-07T20:31:56.1222474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.1223892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44779540e0>}
2025-05-07T20:31:56.1225272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.1226329Z context = <triton._C.libtriton.ir.context object at 0x7f4474db2cb0>
2025-05-07T20:31:56.1226629Z 
2025-05-07T20:31:56.1226799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.1227335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.1227816Z                            module_map=module_map)
2025-05-07T20:31:56.1228187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.1228548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.1228809Z E       ^
2025-05-07T20:31:56.1229281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1229755Z 
2025-05-07T20:31:56.1230183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.1230712Z 
2025-05-07T20:31:56.1230822Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.1231241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.1231655Z     T=16384,
2025-05-07T20:31:56.1231856Z     D=7168,
2025-05-07T20:31:56.1232053Z     scale_ub=1200.0,
2025-05-07T20:31:56.1232274Z     contiguous=True,
2025-05-07T20:31:56.1232503Z     compiled=True,
2025-05-07T20:31:56.1232708Z )
2025-05-07T20:31:56.1233034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.1233629Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:56.1233917Z 
2025-05-07T20:31:56.1234000Z     @given(
2025-05-07T20:31:56.1234231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.1234553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.1234863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.1235193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.1235528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.1235815Z     )
2025-05-07T20:31:56.1236176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.1236623Z     def test_silu_mul_quant(
2025-05-07T20:31:56.1236870Z         self,
2025-05-07T20:31:56.1237074Z         T: int,
2025-05-07T20:31:56.1237277Z         D: int,
2025-05-07T20:31:56.1237495Z         scale_ub: Optional[float],
2025-05-07T20:31:56.1237770Z         contiguous: bool,
2025-05-07T20:31:56.1238015Z         compiled: bool,
2025-05-07T20:31:56.1238248Z     ) -> None:
2025-05-07T20:31:56.1238469Z         torch.manual_seed(2025)
2025-05-07T20:31:56.1238721Z     
2025-05-07T20:31:56.1238994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.1239345Z     
2025-05-07T20:31:56.1239543Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.1239833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.1240148Z         x = x_sign * x_clamp
2025-05-07T20:31:56.1240391Z         x0 = x[:, :D]
2025-05-07T20:31:56.1240608Z         x1 = x[:, D:]
2025-05-07T20:31:56.1240817Z     
2025-05-07T20:31:56.1241004Z         if contiguous:
2025-05-07T20:31:56.1241232Z             x0 = x0.contiguous()
2025-05-07T20:31:56.1241494Z             x1 = x1.contiguous()
2025-05-07T20:31:56.1241852Z     
2025-05-07T20:31:56.1242042Z         if scale_ub is not None:
2025-05-07T20:31:56.1242316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.1242661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.1242976Z             )
2025-05-07T20:31:56.1243168Z         else:
2025-05-07T20:31:56.1243380Z             scale_ub_tensor = None
2025-05-07T20:31:56.1243634Z     
2025-05-07T20:31:56.1243861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1244183Z             op = silu_mul_quant
2025-05-07T20:31:56.1244439Z             if compiled:
2025-05-07T20:31:56.1244686Z                 op = torch.compile(op)
2025-05-07T20:31:56.1244986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1245267Z     
2025-05-07T20:31:56.1245456Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.1245626Z 
2025-05-07T20:31:56.1245726Z moe/activation_test.py:117: 
2025-05-07T20:31:56.1246028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1246372Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.1246654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1247232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.1247811Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.1248482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.1249189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.1249742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.1250446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.1251130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.1251680Z     kernel = self.compile(
2025-05-07T20:31:56.1252312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.1253065Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.1253474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1253716Z 
2025-05-07T20:31:56.1253927Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487629160>
2025-05-07T20:31:56.1255036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.1256452Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477956160>}
2025-05-07T20:31:56.1257845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.1258910Z context = <triton._C.libtriton.ir.context object at 0x7f4474ddc730>
2025-05-07T20:31:56.1259206Z 
2025-05-07T20:31:56.1259380Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.1259914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.1260388Z                            module_map=module_map)
2025-05-07T20:31:56.1260760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.1261122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.1261381Z E       ^
2025-05-07T20:31:56.1261861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1262410Z 
2025-05-07T20:31:56.1262849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.1263377Z 
2025-05-07T20:31:56.2464748Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.2465441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.2466017Z     T=16384,
2025-05-07T20:31:56.2466288Z     D=5120,
2025-05-07T20:31:56.2466557Z     scale_ub=1200.0,
2025-05-07T20:31:56.2466802Z     contiguous=True,
2025-05-07T20:31:56.2467024Z     compiled=False,
2025-05-07T20:31:56.2467234Z )
2025-05-07T20:31:56.2467555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.2468069Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:56.2468365Z 
2025-05-07T20:31:56.2468444Z     @given(
2025-05-07T20:31:56.2468683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.2469010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.2469327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.2469669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.2470004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.2470295Z     )
2025-05-07T20:31:56.2470656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.2471114Z     def test_silu_mul_quant(
2025-05-07T20:31:56.2471401Z         self,
2025-05-07T20:31:56.2471602Z         T: int,
2025-05-07T20:31:56.2471800Z         D: int,
2025-05-07T20:31:56.2472024Z         scale_ub: Optional[float],
2025-05-07T20:31:56.2472308Z         contiguous: bool,
2025-05-07T20:31:56.2472556Z         compiled: bool,
2025-05-07T20:31:56.2472784Z     ) -> None:
2025-05-07T20:31:56.2473016Z         torch.manual_seed(2025)
2025-05-07T20:31:56.2473272Z     
2025-05-07T20:31:56.2473540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.2473898Z     
2025-05-07T20:31:56.2474098Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.2474392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.2474711Z         x = x_sign * x_clamp
2025-05-07T20:31:56.2475136Z         x0 = x[:, :D]
2025-05-07T20:31:56.2475360Z         x1 = x[:, D:]
2025-05-07T20:31:56.2475569Z     
2025-05-07T20:31:56.2475762Z         if contiguous:
2025-05-07T20:31:56.2475994Z             x0 = x0.contiguous()
2025-05-07T20:31:56.2476263Z             x1 = x1.contiguous()
2025-05-07T20:31:56.2476510Z     
2025-05-07T20:31:56.2476702Z         if scale_ub is not None:
2025-05-07T20:31:56.2476981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.2477324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.2477639Z             )
2025-05-07T20:31:56.2477831Z         else:
2025-05-07T20:31:56.2478049Z             scale_ub_tensor = None
2025-05-07T20:31:56.2478313Z     
2025-05-07T20:31:56.2478547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.2478871Z             op = silu_mul_quant
2025-05-07T20:31:56.2479129Z             if compiled:
2025-05-07T20:31:56.2479373Z                 op = torch.compile(op)
2025-05-07T20:31:56.2479677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.2479959Z     
2025-05-07T20:31:56.2480151Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.2480323Z 
2025-05-07T20:31:56.2480430Z moe/activation_test.py:117: 
2025-05-07T20:31:56.2480734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.2481066Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.2481367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.2482118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.2482826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.2483371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.2484208Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.2484897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.2485444Z     kernel = self.compile(
2025-05-07T20:31:56.2485992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.2486671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.2487078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.2487314Z 
2025-05-07T20:31:56.2487528Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762c46e0>
2025-05-07T20:31:56.2488652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.2490103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477847d80>}
2025-05-07T20:31:56.2491495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.2492620Z context = <triton._C.libtriton.ir.context object at 0x7f4474c21a30>
2025-05-07T20:31:56.2492916Z 
2025-05-07T20:31:56.2493096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.2493630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.2494106Z                            module_map=module_map)
2025-05-07T20:31:56.2494489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.2494849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.2495109Z E       ^
2025-05-07T20:31:56.2495678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.2496145Z 
2025-05-07T20:31:56.2496584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.2497114Z 
2025-05-07T20:31:56.2497217Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.2497642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.2498061Z     T=1,
2025-05-07T20:31:56.2498250Z     D=7168,
2025-05-07T20:31:56.2498446Z     scale_ub=1200.0,
2025-05-07T20:31:56.2498674Z     contiguous=False,
2025-05-07T20:31:56.2498901Z     compiled=False,
2025-05-07T20:31:56.2499105Z )
2025-05-07T20:31:56.2499429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.2499938Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:56.2500211Z 
2025-05-07T20:31:56.2500294Z     @given(
2025-05-07T20:31:56.2500533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.2500851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.2501156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.2501492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.2501828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.2502120Z     )
2025-05-07T20:31:56.2502469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.2502920Z     def test_silu_mul_quant(
2025-05-07T20:31:56.2503165Z         self,
2025-05-07T20:31:56.2503356Z         T: int,
2025-05-07T20:31:56.2503554Z         D: int,
2025-05-07T20:31:56.2503776Z         scale_ub: Optional[float],
2025-05-07T20:31:56.2504133Z         contiguous: bool,
2025-05-07T20:31:56.2504374Z         compiled: bool,
2025-05-07T20:31:56.2504601Z     ) -> None:
2025-05-07T20:31:56.2504817Z         torch.manual_seed(2025)
2025-05-07T20:31:56.2505059Z     
2025-05-07T20:31:56.2505338Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.2505682Z     
2025-05-07T20:31:56.2505877Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.2506349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.2506664Z         x = x_sign * x_clamp
2025-05-07T20:31:56.2506905Z         x0 = x[:, :D]
2025-05-07T20:31:56.2507123Z         x1 = x[:, D:]
2025-05-07T20:31:56.2507329Z     
2025-05-07T20:31:56.2507516Z         if contiguous:
2025-05-07T20:31:56.2507755Z             x0 = x0.contiguous()
2025-05-07T20:31:56.2508017Z             x1 = x1.contiguous()
2025-05-07T20:31:56.2508253Z     
2025-05-07T20:31:56.2508444Z         if scale_ub is not None:
2025-05-07T20:31:56.2508725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.2515457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.2515812Z             )
2025-05-07T20:31:56.2516012Z         else:
2025-05-07T20:31:56.2516228Z             scale_ub_tensor = None
2025-05-07T20:31:56.2516484Z     
2025-05-07T20:31:56.2516727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.2517048Z             op = silu_mul_quant
2025-05-07T20:31:56.2517298Z             if compiled:
2025-05-07T20:31:56.2517549Z                 op = torch.compile(op)
2025-05-07T20:31:56.2517848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.2518125Z     
2025-05-07T20:31:56.2518315Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.2518482Z 
2025-05-07T20:31:56.2518589Z moe/activation_test.py:117: 
2025-05-07T20:31:56.2518884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.2519217Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.2519505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.2520214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.2521084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.2521641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.2522339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.2523026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.2523576Z     kernel = self.compile(
2025-05-07T20:31:56.2524127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.2524797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.2525201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.2525440Z 
2025-05-07T20:31:56.2525655Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762c4080>
2025-05-07T20:31:56.2526770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.2528189Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7fce0>}
2025-05-07T20:31:56.2529580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.2530636Z context = <triton._C.libtriton.ir.context object at 0x7f4474ce5bf0>
2025-05-07T20:31:56.2531073Z 
2025-05-07T20:31:56.2531274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.2531881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.2532366Z                            module_map=module_map)
2025-05-07T20:31:56.2532737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.2533093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.2533357Z E       ^
2025-05-07T20:31:56.2533837Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.2534305Z 
2025-05-07T20:31:56.2534738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.2535266Z 
2025-05-07T20:31:56.4251682Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.4252409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.4252992Z     T=4096,
2025-05-07T20:31:56.4253254Z     D=7168,
2025-05-07T20:31:56.4253513Z     scale_ub=1200.0,
2025-05-07T20:31:56.4253796Z     contiguous=False,
2025-05-07T20:31:56.4254096Z     compiled=True,
2025-05-07T20:31:56.4254367Z )
2025-05-07T20:31:56.4254727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.4255243Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:56.4255527Z 
2025-05-07T20:31:56.4255616Z     @given(
2025-05-07T20:31:56.4255846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.4256166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.4256480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.4256808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.4257138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.4257434Z     )
2025-05-07T20:31:56.4257796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.4258243Z     def test_silu_mul_quant(
2025-05-07T20:31:56.4258490Z         self,
2025-05-07T20:31:56.4258691Z         T: int,
2025-05-07T20:31:56.4259054Z         D: int,
2025-05-07T20:31:56.4259280Z         scale_ub: Optional[float],
2025-05-07T20:31:56.4259554Z         contiguous: bool,
2025-05-07T20:31:56.4259796Z         compiled: bool,
2025-05-07T20:31:56.4260029Z     ) -> None:
2025-05-07T20:31:56.4260245Z         torch.manual_seed(2025)
2025-05-07T20:31:56.4260486Z     
2025-05-07T20:31:56.4260767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.4261114Z     
2025-05-07T20:31:56.4261304Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.4261605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.4261928Z         x = x_sign * x_clamp
2025-05-07T20:31:56.4262167Z         x0 = x[:, :D]
2025-05-07T20:31:56.4262386Z         x1 = x[:, D:]
2025-05-07T20:31:56.4262601Z     
2025-05-07T20:31:56.4262793Z         if contiguous:
2025-05-07T20:31:56.4263023Z             x0 = x0.contiguous()
2025-05-07T20:31:56.4263284Z             x1 = x1.contiguous()
2025-05-07T20:31:56.4263540Z     
2025-05-07T20:31:56.4263730Z         if scale_ub is not None:
2025-05-07T20:31:56.4264008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.4264348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.4264656Z             )
2025-05-07T20:31:56.4264850Z         else:
2025-05-07T20:31:56.4265060Z             scale_ub_tensor = None
2025-05-07T20:31:56.4265309Z     
2025-05-07T20:31:56.4265548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.4265872Z             op = silu_mul_quant
2025-05-07T20:31:56.4266121Z             if compiled:
2025-05-07T20:31:56.4266372Z                 op = torch.compile(op)
2025-05-07T20:31:56.4266668Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.4267061Z     
2025-05-07T20:31:56.4267255Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.4267421Z 
2025-05-07T20:31:56.4267522Z moe/activation_test.py:117: 
2025-05-07T20:31:56.4267824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.4268157Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.4268436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.4269005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.4269573Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.4270244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.4270945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.4271490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.4272181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.4272869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.4273418Z     kernel = self.compile(
2025-05-07T20:31:56.4273966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.4274639Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.4275045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.4275279Z 
2025-05-07T20:31:56.4275497Z self = <triton.compiler.compiler.ASTSource object at 0x7f44767620c0>
2025-05-07T20:31:56.4276609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.4278034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44c6ff5f80>}
2025-05-07T20:31:56.4279515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.4280573Z context = <triton._C.libtriton.ir.context object at 0x7f4474b25070>
2025-05-07T20:31:56.4280865Z 
2025-05-07T20:31:56.4281038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.4281594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.4282095Z                            module_map=module_map)
2025-05-07T20:31:56.4282467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.4282824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.4283093Z E       ^
2025-05-07T20:31:56.4283566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.4284029Z 
2025-05-07T20:31:56.4284477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.4285005Z 
2025-05-07T20:31:56.4285110Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.4285531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.4285941Z     T=128,
2025-05-07T20:31:56.4286131Z     D=7168,
2025-05-07T20:31:56.4286320Z     scale_ub=1200.0,
2025-05-07T20:31:56.4286548Z     contiguous=False,
2025-05-07T20:31:56.4286773Z     compiled=True,
2025-05-07T20:31:56.4286972Z )
2025-05-07T20:31:56.5188193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5188917Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:56.5189522Z 
2025-05-07T20:31:56.5189628Z     @given(
2025-05-07T20:31:56.5189874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5190191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5190520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5190861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5191225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5191541Z     )
2025-05-07T20:31:56.5191903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5192361Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5192610Z         self,
2025-05-07T20:31:56.5192814Z         T: int,
2025-05-07T20:31:56.5193020Z         D: int,
2025-05-07T20:31:56.5193244Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5193530Z         contiguous: bool,
2025-05-07T20:31:56.5193783Z         compiled: bool,
2025-05-07T20:31:56.5194017Z     ) -> None:
2025-05-07T20:31:56.5194242Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5194492Z     
2025-05-07T20:31:56.5194766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5195122Z     
2025-05-07T20:31:56.5195329Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.5195623Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.5195943Z         x = x_sign * x_clamp
2025-05-07T20:31:56.5196191Z         x0 = x[:, :D]
2025-05-07T20:31:56.5196415Z         x1 = x[:, D:]
2025-05-07T20:31:56.5196628Z     
2025-05-07T20:31:56.5196824Z         if contiguous:
2025-05-07T20:31:56.5197066Z             x0 = x0.contiguous()
2025-05-07T20:31:56.5197331Z             x1 = x1.contiguous()
2025-05-07T20:31:56.5197582Z     
2025-05-07T20:31:56.5197778Z         if scale_ub is not None:
2025-05-07T20:31:56.5198055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.5198404Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.5198730Z             )
2025-05-07T20:31:56.5198926Z         else:
2025-05-07T20:31:56.5199149Z             scale_ub_tensor = None
2025-05-07T20:31:56.5199411Z     
2025-05-07T20:31:56.5199780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.5200109Z             op = silu_mul_quant
2025-05-07T20:31:56.5200366Z             if compiled:
2025-05-07T20:31:56.5200620Z                 op = torch.compile(op)
2025-05-07T20:31:56.5200931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.5201211Z     
2025-05-07T20:31:56.5201415Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.5201583Z 
2025-05-07T20:31:56.5201689Z moe/activation_test.py:117: 
2025-05-07T20:31:56.5201993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.5202338Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.5202630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.5203212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.5203794Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.5204488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.5205201Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.5205751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.5206632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.5207317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.5207868Z     kernel = self.compile(
2025-05-07T20:31:56.5208424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.5209103Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5209676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.5209915Z 
2025-05-07T20:31:56.5210138Z self = <triton.compiler.compiler.ASTSource object at 0x7f447625a8d0>
2025-05-07T20:31:56.5211259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.5212755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f448572d9e0>}
2025-05-07T20:31:56.5214176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.5215274Z context = <triton._C.libtriton.ir.context object at 0x7f4474bea7f0>
2025-05-07T20:31:56.5215577Z 
2025-05-07T20:31:56.5215757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.5216311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5216803Z                            module_map=module_map)
2025-05-07T20:31:56.5217187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5217551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5217818Z E       ^
2025-05-07T20:31:56.5218306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.5218778Z 
2025-05-07T20:31:56.5219218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.5219748Z 
2025-05-07T20:31:56.5219855Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5220291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5220705Z     T=2048,
2025-05-07T20:31:56.5220897Z     D=7168,
2025-05-07T20:31:56.5221099Z     scale_ub=None,
2025-05-07T20:31:56.5221446Z     contiguous=True,
2025-05-07T20:31:56.5221679Z     compiled=True,
2025-05-07T20:31:56.5221890Z )
2025-05-07T20:31:56.5222224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5222730Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.5223017Z 
2025-05-07T20:31:56.5223099Z     @given(
2025-05-07T20:31:56.5223337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5223661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5223970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5224308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5224650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5224942Z     )
2025-05-07T20:31:56.5225301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5225759Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5226012Z         self,
2025-05-07T20:31:56.5226217Z         T: int,
2025-05-07T20:31:56.5226420Z         D: int,
2025-05-07T20:31:56.5226642Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5226920Z         contiguous: bool,
2025-05-07T20:31:56.5227169Z         compiled: bool,
2025-05-07T20:31:56.5227396Z     ) -> None:
2025-05-07T20:31:56.5227614Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5227860Z     
2025-05-07T20:31:56.5228142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5228485Z     
2025-05-07T20:31:56.5228684Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.5228980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.5229294Z         x = x_sign * x_clamp
2025-05-07T20:31:56.5229628Z         x0 = x[:, :D]
2025-05-07T20:31:56.5229851Z         x1 = x[:, D:]
2025-05-07T20:31:56.5230060Z     
2025-05-07T20:31:56.5230255Z         if contiguous:
2025-05-07T20:31:56.5230495Z             x0 = x0.contiguous()
2025-05-07T20:31:56.5230764Z             x1 = x1.contiguous()
2025-05-07T20:31:56.5231011Z     
2025-05-07T20:31:56.5231209Z         if scale_ub is not None:
2025-05-07T20:31:56.5231483Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.5231825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.5232141Z             )
2025-05-07T20:31:56.5232336Z         else:
2025-05-07T20:31:56.5232550Z             scale_ub_tensor = None
2025-05-07T20:31:56.5232808Z     
2025-05-07T20:31:56.5233046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.5233366Z             op = silu_mul_quant
2025-05-07T20:31:56.5233621Z             if compiled:
2025-05-07T20:31:56.5233876Z                 op = torch.compile(op)
2025-05-07T20:31:56.5234183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.5234463Z     
2025-05-07T20:31:56.5234660Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.5234828Z 
2025-05-07T20:31:56.5234929Z moe/activation_test.py:117: 
2025-05-07T20:31:56.5235239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.5235580Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.5235861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.5236435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.5237008Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.5237685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.5238387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.5238943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.5239651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.5240418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.5240968Z     kernel = self.compile(
2025-05-07T20:31:56.5241522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.5242200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5242606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.5242848Z 
2025-05-07T20:31:56.5243061Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476881a00>
2025-05-07T20:31:56.5244176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.5245609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447521d800>}
2025-05-07T20:31:56.5247006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.5248087Z context = <triton._C.libtriton.ir.context object at 0x7f4474bf49f0>
2025-05-07T20:31:56.5248393Z 
2025-05-07T20:31:56.5248565Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.5249121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5249608Z                            module_map=module_map)
2025-05-07T20:31:56.5249983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5250433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5250701Z E       ^
2025-05-07T20:31:56.5251211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.5251719Z 
2025-05-07T20:31:56.5252217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.5252762Z 
2025-05-07T20:31:56.5875930Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5876591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5877184Z     T=16384,
2025-05-07T20:31:56.5877443Z     D=5120,
2025-05-07T20:31:56.5877706Z     scale_ub=None,
2025-05-07T20:31:56.5878018Z     contiguous=False,
2025-05-07T20:31:56.5878261Z     compiled=False,
2025-05-07T20:31:56.5878469Z )
2025-05-07T20:31:56.5878793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5879315Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:56.5879615Z 
2025-05-07T20:31:56.5879696Z     @given(
2025-05-07T20:31:56.5879932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5880255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5880569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5880904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5881239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5881526Z     )
2025-05-07T20:31:56.5881882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5882336Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5882575Z         self,
2025-05-07T20:31:56.5882779Z         T: int,
2025-05-07T20:31:56.5882988Z         D: int,
2025-05-07T20:31:56.5883209Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5883485Z         contiguous: bool,
2025-05-07T20:31:56.5883736Z         compiled: bool,
2025-05-07T20:31:56.5883959Z     ) -> None:
2025-05-07T20:31:56.5884179Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5884421Z     
2025-05-07T20:31:56.5884861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5885214Z     
2025-05-07T20:31:56.5885406Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.5885694Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.5887800Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.5889763Z 
2025-05-07T20:31:56.5889886Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:56.5890106Z 
2025-05-07T20:31:56.5890210Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5890643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5891086Z     T=4096,
2025-05-07T20:31:56.5891274Z     D=7168,
2025-05-07T20:31:56.5891470Z     scale_ub=1200.0,
2025-05-07T20:31:56.5891697Z     contiguous=True,
2025-05-07T20:31:56.5891989Z     compiled=True,
2025-05-07T20:31:56.5892195Z )
2025-05-07T20:31:56.5892521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5893026Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:56.5893307Z 
2025-05-07T20:31:56.5893389Z     @given(
2025-05-07T20:31:56.5893621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5893945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5894382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5894717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5895059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5895343Z     )
2025-05-07T20:31:56.5895694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5896150Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5896393Z         self,
2025-05-07T20:31:56.5896585Z         T: int,
2025-05-07T20:31:56.5896793Z         D: int,
2025-05-07T20:31:56.5897014Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5897286Z         contiguous: bool,
2025-05-07T20:31:56.5897532Z         compiled: bool,
2025-05-07T20:31:56.5897759Z     ) -> None:
2025-05-07T20:31:56.5897971Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5898213Z     
2025-05-07T20:31:56.5898489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5898834Z     
2025-05-07T20:31:56.5899033Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.5899333Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.5901420Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.5903360Z 
2025-05-07T20:31:56.5903490Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:56.5903705Z 
2025-05-07T20:31:56.5903812Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5904233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5904653Z     T=16384,
2025-05-07T20:31:56.5904851Z     D=7168,
2025-05-07T20:31:56.5905049Z     scale_ub=None,
2025-05-07T20:31:56.5905266Z     contiguous=False,
2025-05-07T20:31:56.5905608Z     compiled=False,
2025-05-07T20:31:56.5905821Z )
2025-05-07T20:31:56.5906335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5906863Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:56.5907153Z 
2025-05-07T20:31:56.5913914Z     @given(
2025-05-07T20:31:56.5914201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5914559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5914895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5915256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5915593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5915893Z     )
2025-05-07T20:31:56.5916253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5916710Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5916955Z         self,
2025-05-07T20:31:56.5917158Z         T: int,
2025-05-07T20:31:56.5917356Z         D: int,
2025-05-07T20:31:56.5917571Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5917847Z         contiguous: bool,
2025-05-07T20:31:56.5918090Z         compiled: bool,
2025-05-07T20:31:56.5918319Z     ) -> None:
2025-05-07T20:31:56.5918532Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5918784Z     
2025-05-07T20:31:56.5919063Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5921219Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.5923336Z 
2025-05-07T20:31:56.5923461Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.5923678Z 
2025-05-07T20:31:56.5923783Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5924207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5924621Z     T=2048,
2025-05-07T20:31:56.5924812Z     D=7168,
2025-05-07T20:31:56.5925008Z     scale_ub=1200.0,
2025-05-07T20:31:56.5925233Z     contiguous=True,
2025-05-07T20:31:56.5925453Z     compiled=True,
2025-05-07T20:31:56.5925661Z )
2025-05-07T20:31:56.5925988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.5926501Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:56.5926788Z 
2025-05-07T20:31:56.5926867Z     @given(
2025-05-07T20:31:56.5927102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.5927425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.5927736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.5928073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.5928406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.5928694Z     )
2025-05-07T20:31:56.5929050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.5929504Z     def test_silu_mul_quant(
2025-05-07T20:31:56.5929747Z         self,
2025-05-07T20:31:56.5929946Z         T: int,
2025-05-07T20:31:56.5930146Z         D: int,
2025-05-07T20:31:56.5930363Z         scale_ub: Optional[float],
2025-05-07T20:31:56.5930640Z         contiguous: bool,
2025-05-07T20:31:56.5930888Z         compiled: bool,
2025-05-07T20:31:56.5931110Z     ) -> None:
2025-05-07T20:31:56.5931333Z         torch.manual_seed(2025)
2025-05-07T20:31:56.5931581Z     
2025-05-07T20:31:56.5932042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.5932388Z     
2025-05-07T20:31:56.5932584Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.5932878Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.5934948Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.5936887Z 
2025-05-07T20:31:56.5937009Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:56.5937232Z 
2025-05-07T20:31:56.5937336Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.5937762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.5938176Z     T=2048,
2025-05-07T20:31:56.5938362Z     D=7168,
2025-05-07T20:31:56.5938558Z     scale_ub=None,
2025-05-07T20:31:56.5938773Z     contiguous=True,
2025-05-07T20:31:56.5938997Z     compiled=False,
2025-05-07T20:31:56.5939204Z )
2025-05-07T20:31:56.7042423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7043214Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.7043597Z 
2025-05-07T20:31:56.7043700Z     @given(
2025-05-07T20:31:56.7044018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7044369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7044853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7045189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7045523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7045824Z     )
2025-05-07T20:31:56.7046175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7046624Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7046875Z         self,
2025-05-07T20:31:56.7047067Z         T: int,
2025-05-07T20:31:56.7047266Z         D: int,
2025-05-07T20:31:56.7047486Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7047754Z         contiguous: bool,
2025-05-07T20:31:56.7047996Z         compiled: bool,
2025-05-07T20:31:56.7048222Z     ) -> None:
2025-05-07T20:31:56.7048435Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7048678Z     
2025-05-07T20:31:56.7048957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7049305Z     
2025-05-07T20:31:56.7049503Z >       x_sign = torch.sign(x)
2025-05-07T20:31:56.7051533Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.7053571Z 
2025-05-07T20:31:56.7053690Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:56.7053908Z 
2025-05-07T20:31:56.7054018Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7054435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7054846Z     T=1,
2025-05-07T20:31:56.7055036Z     D=7168,
2025-05-07T20:31:56.7055228Z     scale_ub=1200.0,
2025-05-07T20:31:56.7055455Z     contiguous=True,
2025-05-07T20:31:56.7055680Z     compiled=False,
2025-05-07T20:31:56.7055880Z )
2025-05-07T20:31:56.7056333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7056839Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:56.7057111Z 
2025-05-07T20:31:56.7057201Z     @given(
2025-05-07T20:31:56.7057434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7057756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7058071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7058399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7058737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7059025Z     )
2025-05-07T20:31:56.7059376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7059832Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7060080Z         self,
2025-05-07T20:31:56.7060272Z         T: int,
2025-05-07T20:31:56.7060470Z         D: int,
2025-05-07T20:31:56.7060691Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7060970Z         contiguous: bool,
2025-05-07T20:31:56.7061210Z         compiled: bool,
2025-05-07T20:31:56.7061432Z     ) -> None:
2025-05-07T20:31:56.7061653Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7061897Z     
2025-05-07T20:31:56.7062175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7062520Z     
2025-05-07T20:31:56.7062711Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.7063005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.7063323Z         x = x_sign * x_clamp
2025-05-07T20:31:56.7063562Z         x0 = x[:, :D]
2025-05-07T20:31:56.7063781Z         x1 = x[:, D:]
2025-05-07T20:31:56.7063994Z     
2025-05-07T20:31:56.7064173Z         if contiguous:
2025-05-07T20:31:56.7064494Z             x0 = x0.contiguous()
2025-05-07T20:31:56.7064750Z             x1 = x1.contiguous()
2025-05-07T20:31:56.7064986Z     
2025-05-07T20:31:56.7065176Z         if scale_ub is not None:
2025-05-07T20:31:56.7065457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.7065794Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.7066115Z             )
2025-05-07T20:31:56.7066313Z         else:
2025-05-07T20:31:56.7066532Z             scale_ub_tensor = None
2025-05-07T20:31:56.7066786Z     
2025-05-07T20:31:56.7067026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7067353Z             op = silu_mul_quant
2025-05-07T20:31:56.7067604Z             if compiled:
2025-05-07T20:31:56.7067855Z                 op = torch.compile(op)
2025-05-07T20:31:56.7068160Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7068434Z     
2025-05-07T20:31:56.7068630Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.7068802Z 
2025-05-07T20:31:56.7068912Z moe/activation_test.py:117: 
2025-05-07T20:31:56.7069210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7069550Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.7069844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7070558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.7071278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.7071876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.7072584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.7073270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.7073819Z     kernel = self.compile(
2025-05-07T20:31:56.7074388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.7075069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7075559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7075804Z 
2025-05-07T20:31:56.7076029Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f13fb0>
2025-05-07T20:31:56.7077154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.7078580Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5cea0>}
2025-05-07T20:31:56.7079983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.7081047Z context = <triton._C.libtriton.ir.context object at 0x7f4474e1e5f0>
2025-05-07T20:31:56.7081351Z 
2025-05-07T20:31:56.7081524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.7082111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7082587Z                            module_map=module_map)
2025-05-07T20:31:56.7082956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7083319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.7083586Z E       ^
2025-05-07T20:31:56.7084062Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7084531Z 
2025-05-07T20:31:56.7084965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.7085586Z 
2025-05-07T20:31:56.7085695Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7086126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7086535Z     T=128,
2025-05-07T20:31:56.7086728Z     D=5120,
2025-05-07T20:31:56.7086924Z     scale_ub=None,
2025-05-07T20:31:56.7087137Z     contiguous=True,
2025-05-07T20:31:56.7087365Z     compiled=False,
2025-05-07T20:31:56.7087576Z )
2025-05-07T20:31:56.7752386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7753179Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.7753570Z 
2025-05-07T20:31:56.7753679Z     @given(
2025-05-07T20:31:56.7753998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7754425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7754790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7755135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7755468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7755751Z     )
2025-05-07T20:31:56.7756113Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7756566Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7756813Z         self,
2025-05-07T20:31:56.7757007Z         T: int,
2025-05-07T20:31:56.7757209Z         D: int,
2025-05-07T20:31:56.7757433Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7757705Z         contiguous: bool,
2025-05-07T20:31:56.7757945Z         compiled: bool,
2025-05-07T20:31:56.7758172Z     ) -> None:
2025-05-07T20:31:56.7758389Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7758634Z     
2025-05-07T20:31:56.7758916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7759263Z     
2025-05-07T20:31:56.7759456Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.7759755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.7760071Z         x = x_sign * x_clamp
2025-05-07T20:31:56.7760314Z         x0 = x[:, :D]
2025-05-07T20:31:56.7760732Z         x1 = x[:, D:]
2025-05-07T20:31:56.7760951Z     
2025-05-07T20:31:56.7761137Z         if contiguous:
2025-05-07T20:31:56.7761405Z             x0 = x0.contiguous()
2025-05-07T20:31:56.7761699Z             x1 = x1.contiguous()
2025-05-07T20:31:56.7761935Z     
2025-05-07T20:31:56.7762133Z         if scale_ub is not None:
2025-05-07T20:31:56.7762405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.7762741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.7763056Z             )
2025-05-07T20:31:56.7763254Z         else:
2025-05-07T20:31:56.7763460Z             scale_ub_tensor = None
2025-05-07T20:31:56.7763711Z     
2025-05-07T20:31:56.7763946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7764268Z             op = silu_mul_quant
2025-05-07T20:31:56.7764529Z             if compiled:
2025-05-07T20:31:56.7764779Z                 op = torch.compile(op)
2025-05-07T20:31:56.7765080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7765364Z     
2025-05-07T20:31:56.7765562Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.7765729Z 
2025-05-07T20:31:56.7765841Z moe/activation_test.py:117: 
2025-05-07T20:31:56.7766142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7766479Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.7766766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7767469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.7768177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.7768732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.7769569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.7770255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.7770806Z     kernel = self.compile(
2025-05-07T20:31:56.7771362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.7772122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7772582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7772853Z 
2025-05-07T20:31:56.7773083Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476b66090>
2025-05-07T20:31:56.7774409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.7776127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5df80>}
2025-05-07T20:31:56.7777796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.7779058Z context = <triton._C.libtriton.ir.context object at 0x7f4474932e30>
2025-05-07T20:31:56.7779399Z 
2025-05-07T20:31:56.7779587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.7780196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7780738Z                            module_map=module_map)
2025-05-07T20:31:56.7781152Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7781561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.7781840Z E       ^
2025-05-07T20:31:56.7782390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7783029Z 
2025-05-07T20:31:56.7783533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.7784157Z 
2025-05-07T20:31:56.7784274Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7784744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7785213Z     T=128,
2025-05-07T20:31:56.7785413Z     D=7168,
2025-05-07T20:31:56.7785613Z     scale_ub=None,
2025-05-07T20:31:56.7785845Z     contiguous=True,
2025-05-07T20:31:56.7786086Z     compiled=False,
2025-05-07T20:31:56.7786299Z )
2025-05-07T20:31:56.7786661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7787240Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.7787552Z 
2025-05-07T20:31:56.7787636Z     @given(
2025-05-07T20:31:56.7787877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7788234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7788575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7788937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7789308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7789627Z     )
2025-05-07T20:31:56.7790022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7790537Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7790797Z         self,
2025-05-07T20:31:56.7791002Z         T: int,
2025-05-07T20:31:56.7791229Z         D: int,
2025-05-07T20:31:56.7791511Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7791806Z         contiguous: bool,
2025-05-07T20:31:56.7792159Z         compiled: bool,
2025-05-07T20:31:56.7792396Z     ) -> None:
2025-05-07T20:31:56.7792623Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7792889Z     
2025-05-07T20:31:56.7793186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7793574Z     
2025-05-07T20:31:56.7793784Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.7794098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.7794440Z         x = x_sign * x_clamp
2025-05-07T20:31:56.7794698Z         x0 = x[:, :D]
2025-05-07T20:31:56.7794926Z         x1 = x[:, D:]
2025-05-07T20:31:56.7795145Z     
2025-05-07T20:31:56.7795341Z         if contiguous:
2025-05-07T20:31:56.7795586Z             x0 = x0.contiguous()
2025-05-07T20:31:56.7795863Z             x1 = x1.contiguous()
2025-05-07T20:31:56.7796125Z     
2025-05-07T20:31:56.7796321Z         if scale_ub is not None:
2025-05-07T20:31:56.7796622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.7796995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.7797349Z             )
2025-05-07T20:31:56.7797551Z         else:
2025-05-07T20:31:56.7797774Z             scale_ub_tensor = None
2025-05-07T20:31:56.7798050Z     
2025-05-07T20:31:56.7798298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7798651Z             op = silu_mul_quant
2025-05-07T20:31:56.7798922Z             if compiled:
2025-05-07T20:31:56.7799184Z                 op = torch.compile(op)
2025-05-07T20:31:56.7799507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7799818Z     
2025-05-07T20:31:56.7800016Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.7800199Z 
2025-05-07T20:31:56.7800304Z moe/activation_test.py:117: 
2025-05-07T20:31:56.7800630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7801002Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.7801314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7802143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.7802971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.7803683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.7804500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.7805293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.7805928Z     kernel = self.compile(
2025-05-07T20:31:56.7806677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.7807357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7807768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7808009Z 
2025-05-07T20:31:56.7808221Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477988680>
2025-05-07T20:31:56.7809336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.7810755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5ee80>}
2025-05-07T20:31:56.7812264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.7813322Z context = <triton._C.libtriton.ir.context object at 0x7f4474841330>
2025-05-07T20:31:56.7813614Z 
2025-05-07T20:31:56.7813783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.7814455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7814936Z                            module_map=module_map)
2025-05-07T20:31:56.7815309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7815671Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.7815939Z E       ^
2025-05-07T20:31:56.7816415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7816881Z 
2025-05-07T20:31:56.7817315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.7817848Z 
2025-05-07T20:31:56.7817956Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7818382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7818795Z     T=2048,
2025-05-07T20:31:56.7818988Z     D=7168,
2025-05-07T20:31:56.7819185Z     scale_ub=1200.0,
2025-05-07T20:31:56.7819411Z     contiguous=True,
2025-05-07T20:31:56.7819634Z     compiled=False,
2025-05-07T20:31:56.7819839Z )
2025-05-07T20:31:56.8613698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.8614484Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:56.8614875Z 
2025-05-07T20:31:56.8614982Z     @given(
2025-05-07T20:31:56.8615274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.8615594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.8615908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.8616246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.8616577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.8616871Z     )
2025-05-07T20:31:56.8617230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.8617682Z     def test_silu_mul_quant(
2025-05-07T20:31:56.8617928Z         self,
2025-05-07T20:31:56.8618126Z         T: int,
2025-05-07T20:31:56.8618320Z         D: int,
2025-05-07T20:31:56.8618705Z         scale_ub: Optional[float],
2025-05-07T20:31:56.8618989Z         contiguous: bool,
2025-05-07T20:31:56.8619225Z         compiled: bool,
2025-05-07T20:31:56.8619451Z     ) -> None:
2025-05-07T20:31:56.8619670Z         torch.manual_seed(2025)
2025-05-07T20:31:56.8619922Z     
2025-05-07T20:31:56.8620196Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.8622327Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.8624276Z 
2025-05-07T20:31:56.8624402Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.8624619Z 
2025-05-07T20:31:56.8624731Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.8625153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.8625567Z     T=1,
2025-05-07T20:31:56.8625753Z     D=5120,
2025-05-07T20:31:56.8625948Z     scale_ub=1200.0,
2025-05-07T20:31:56.8626170Z     contiguous=True,
2025-05-07T20:31:56.8626393Z     compiled=False,
2025-05-07T20:31:56.8626604Z )
2025-05-07T20:31:56.8626926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.8627433Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:56.8627707Z 
2025-05-07T20:31:56.8627941Z     @given(
2025-05-07T20:31:56.8628175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.8628501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.8628826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.8636367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.8636720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.8637018Z     )
2025-05-07T20:31:56.8637376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.8637839Z     def test_silu_mul_quant(
2025-05-07T20:31:56.8638086Z         self,
2025-05-07T20:31:56.8638279Z         T: int,
2025-05-07T20:31:56.8638474Z         D: int,
2025-05-07T20:31:56.8638687Z         scale_ub: Optional[float],
2025-05-07T20:31:56.8638961Z         contiguous: bool,
2025-05-07T20:31:56.8639203Z         compiled: bool,
2025-05-07T20:31:56.8639430Z     ) -> None:
2025-05-07T20:31:56.8639642Z         torch.manual_seed(2025)
2025-05-07T20:31:56.8639899Z     
2025-05-07T20:31:56.8640179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.8640529Z     
2025-05-07T20:31:56.8640717Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.8641017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.8641331Z         x = x_sign * x_clamp
2025-05-07T20:31:56.8641570Z         x0 = x[:, :D]
2025-05-07T20:31:56.8641788Z         x1 = x[:, D:]
2025-05-07T20:31:56.8641999Z     
2025-05-07T20:31:56.8642182Z         if contiguous:
2025-05-07T20:31:56.8642413Z             x0 = x0.contiguous()
2025-05-07T20:31:56.8642683Z             x1 = x1.contiguous()
2025-05-07T20:31:56.8642916Z     
2025-05-07T20:31:56.8643111Z         if scale_ub is not None:
2025-05-07T20:31:56.8643391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.8643731Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.8644043Z             )
2025-05-07T20:31:56.8644242Z         else:
2025-05-07T20:31:56.8644455Z             scale_ub_tensor = None
2025-05-07T20:31:56.8644703Z     
2025-05-07T20:31:56.8644932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.8645249Z             op = silu_mul_quant
2025-05-07T20:31:56.8645603Z             if compiled:
2025-05-07T20:31:56.8645854Z                 op = torch.compile(op)
2025-05-07T20:31:56.8646153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.8646426Z     
2025-05-07T20:31:56.8646617Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.8646784Z 
2025-05-07T20:31:56.8646890Z moe/activation_test.py:117: 
2025-05-07T20:31:56.8647188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.8647529Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.8647818Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.8648533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.8649247Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.8649802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.8650507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.8651187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.8651737Z     kernel = self.compile(
2025-05-07T20:31:56.8652373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.8653050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.8653462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.8653699Z 
2025-05-07T20:31:56.8653911Z self = <triton.compiler.compiler.ASTSource object at 0x7f447798a7e0>
2025-05-07T20:31:56.8655121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.8656540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474a1c400>}
2025-05-07T20:31:56.8657925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.8658985Z context = <triton._C.libtriton.ir.context object at 0x7f4474a835b0>
2025-05-07T20:31:56.8659283Z 
2025-05-07T20:31:56.8659453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.8659993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.8660479Z                            module_map=module_map)
2025-05-07T20:31:56.8660853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.8661217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.8661481Z E       ^
2025-05-07T20:31:56.8661953Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.8662422Z 
2025-05-07T20:31:56.8662851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.8663379Z 
2025-05-07T20:31:56.8663490Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.8663908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.8664322Z     T=2048,
2025-05-07T20:31:56.8664510Z     D=5120,
2025-05-07T20:31:56.8664698Z     scale_ub=None,
2025-05-07T20:31:56.8664909Z     contiguous=True,
2025-05-07T20:31:56.8665144Z     compiled=False,
2025-05-07T20:31:56.8665348Z )
2025-05-07T20:31:56.8665669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.8666257Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.8666540Z 
2025-05-07T20:31:56.8666621Z     @given(
2025-05-07T20:31:56.8666846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.8667164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.8667478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.8667808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.8668145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.8668435Z     )
2025-05-07T20:31:56.8668794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.8669239Z     def test_silu_mul_quant(
2025-05-07T20:31:56.8669482Z         self,
2025-05-07T20:31:56.8669685Z         T: int,
2025-05-07T20:31:56.8669879Z         D: int,
2025-05-07T20:31:56.8670096Z         scale_ub: Optional[float],
2025-05-07T20:31:56.8670365Z         contiguous: bool,
2025-05-07T20:31:56.8670599Z         compiled: bool,
2025-05-07T20:31:56.8670830Z     ) -> None:
2025-05-07T20:31:56.8671046Z         torch.manual_seed(2025)
2025-05-07T20:31:56.8671297Z     
2025-05-07T20:31:56.8671611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.8671957Z     
2025-05-07T20:31:56.8672147Z >       x_sign = torch.sign(x)
2025-05-07T20:31:56.8674174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.8676193Z 
2025-05-07T20:31:56.8676318Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:56.8676538Z 
2025-05-07T20:31:56.8676639Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.8677061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.8677474Z     T=16384,
2025-05-07T20:31:56.8677667Z     D=5120,
2025-05-07T20:31:56.8677857Z     scale_ub=None,
2025-05-07T20:31:56.8678064Z     contiguous=True,
2025-05-07T20:31:56.8678287Z     compiled=False,
2025-05-07T20:31:56.8678488Z )
2025-05-07T20:31:56.9416944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9417740Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.9418125Z 
2025-05-07T20:31:56.9418236Z     @given(
2025-05-07T20:31:56.9418498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9418813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9419120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9419450Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9419776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9420058Z     )
2025-05-07T20:31:56.9420411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9420868Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9421119Z         self,
2025-05-07T20:31:56.9421335Z         T: int,
2025-05-07T20:31:56.9421552Z         D: int,
2025-05-07T20:31:56.9421771Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9422035Z         contiguous: bool,
2025-05-07T20:31:56.9422278Z         compiled: bool,
2025-05-07T20:31:56.9422497Z     ) -> None:
2025-05-07T20:31:56.9422710Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9422961Z     
2025-05-07T20:31:56.9423237Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9425603Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.9427584Z 
2025-05-07T20:31:56.9427713Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.9427933Z 
2025-05-07T20:31:56.9428040Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9428473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9428897Z     T=4096,
2025-05-07T20:31:56.9429080Z     D=5120,
2025-05-07T20:31:56.9429279Z     scale_ub=None,
2025-05-07T20:31:56.9429494Z     contiguous=True,
2025-05-07T20:31:56.9429721Z     compiled=False,
2025-05-07T20:31:56.9429920Z )
2025-05-07T20:31:56.9430255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9430771Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:56.9431054Z 
2025-05-07T20:31:56.9431135Z     @given(
2025-05-07T20:31:56.9431367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9431687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9431994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9432330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9432670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9432959Z     )
2025-05-07T20:31:56.9433310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9433900Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9434146Z         self,
2025-05-07T20:31:56.9434337Z         T: int,
2025-05-07T20:31:56.9434537Z         D: int,
2025-05-07T20:31:56.9434765Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9435031Z         contiguous: bool,
2025-05-07T20:31:56.9435279Z         compiled: bool,
2025-05-07T20:31:56.9435505Z     ) -> None:
2025-05-07T20:31:56.9435714Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9435958Z     
2025-05-07T20:31:56.9436227Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9438351Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.9440298Z 
2025-05-07T20:31:56.9440423Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.9440635Z 
2025-05-07T20:31:56.9440739Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9441164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9441580Z     T=2048,
2025-05-07T20:31:56.9441765Z     D=5120,
2025-05-07T20:31:56.9441954Z     scale_ub=None,
2025-05-07T20:31:56.9442167Z     contiguous=False,
2025-05-07T20:31:56.9442389Z     compiled=False,
2025-05-07T20:31:56.9442592Z )
2025-05-07T20:31:56.9442912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9443418Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:56.9443709Z 
2025-05-07T20:31:56.9443786Z     @given(
2025-05-07T20:31:56.9444018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9444336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9444730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9445070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9445402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9445702Z     )
2025-05-07T20:31:56.9446055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9446506Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9446755Z         self,
2025-05-07T20:31:56.9446946Z         T: int,
2025-05-07T20:31:56.9447143Z         D: int,
2025-05-07T20:31:56.9447365Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9447637Z         contiguous: bool,
2025-05-07T20:31:56.9447883Z         compiled: bool,
2025-05-07T20:31:56.9448107Z     ) -> None:
2025-05-07T20:31:56.9448324Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9448569Z     
2025-05-07T20:31:56.9448844Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9450969Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.9452992Z 
2025-05-07T20:31:56.9453117Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.9453334Z 
2025-05-07T20:31:56.9453437Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9453857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9454387Z     T=4096,
2025-05-07T20:31:56.9454566Z     D=7168,
2025-05-07T20:31:56.9454753Z     scale_ub=None,
2025-05-07T20:31:56.9454970Z     contiguous=True,
2025-05-07T20:31:56.9455195Z     compiled=True,
2025-05-07T20:31:56.9455405Z )
2025-05-07T20:31:56.9455729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9456236Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.9456514Z 
2025-05-07T20:31:56.9456593Z     @given(
2025-05-07T20:31:56.9456825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9457144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9457450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9457786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9458119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9458408Z     )
2025-05-07T20:31:56.9458765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9459215Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9459460Z         self,
2025-05-07T20:31:56.9459657Z         T: int,
2025-05-07T20:31:56.9459851Z         D: int,
2025-05-07T20:31:56.9460072Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9460342Z         contiguous: bool,
2025-05-07T20:31:56.9460582Z         compiled: bool,
2025-05-07T20:31:56.9460811Z     ) -> None:
2025-05-07T20:31:56.9461022Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9461270Z     
2025-05-07T20:31:56.9461544Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9463751Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.9465691Z 
2025-05-07T20:31:56.9465812Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.9466031Z 
2025-05-07T20:31:56.9466133Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9466552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9466967Z     T=2048,
2025-05-07T20:31:56.9467153Z     D=5120,
2025-05-07T20:31:56.9467349Z     scale_ub=1200.0,
2025-05-07T20:31:56.9467576Z     contiguous=False,
2025-05-07T20:31:56.9467800Z     compiled=False,
2025-05-07T20:31:56.9468006Z )
2025-05-07T20:31:56.9468334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9468842Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:56.9469137Z 
2025-05-07T20:31:56.9469214Z     @given(
2025-05-07T20:31:56.9469447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9469765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9470074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9470410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9470743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9471028Z     )
2025-05-07T20:31:56.9471383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9471836Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9472074Z         self,
2025-05-07T20:31:56.9472274Z         T: int,
2025-05-07T20:31:56.9472468Z         D: int,
2025-05-07T20:31:56.9472679Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9472955Z         contiguous: bool,
2025-05-07T20:31:56.9473196Z         compiled: bool,
2025-05-07T20:31:56.9473499Z     ) -> None:
2025-05-07T20:31:56.9473714Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9473957Z     
2025-05-07T20:31:56.9474224Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9476352Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:56.9478279Z 
2025-05-07T20:31:56.9478398Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:56.9478621Z 
2025-05-07T20:31:56.9478729Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9479152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9479559Z     T=4096,
2025-05-07T20:31:56.9479743Z     D=7168,
2025-05-07T20:31:56.9479938Z     scale_ub=1200.0,
2025-05-07T20:31:56.9480156Z     contiguous=True,
2025-05-07T20:31:56.9480379Z     compiled=False,
2025-05-07T20:31:56.9480583Z )
2025-05-07T20:31:57.0522762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0523543Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.0523950Z 
2025-05-07T20:31:57.0524055Z     @given(
2025-05-07T20:31:57.0524362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0524759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0525069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0525404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0525754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0526037Z     )
2025-05-07T20:31:57.0526388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0527009Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0527251Z         self,
2025-05-07T20:31:57.0527449Z         T: int,
2025-05-07T20:31:57.0527645Z         D: int,
2025-05-07T20:31:57.0527860Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0528138Z         contiguous: bool,
2025-05-07T20:31:57.0528384Z         compiled: bool,
2025-05-07T20:31:57.0528613Z     ) -> None:
2025-05-07T20:31:57.0528832Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0529081Z     
2025-05-07T20:31:57.0529352Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0531536Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.0533589Z 
2025-05-07T20:31:57.0533711Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.0533936Z 
2025-05-07T20:31:57.0534041Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0534467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0534881Z     T=16384,
2025-05-07T20:31:57.0535085Z     D=7168,
2025-05-07T20:31:57.0535283Z     scale_ub=None,
2025-05-07T20:31:57.0535498Z     contiguous=False,
2025-05-07T20:31:57.0535734Z     compiled=True,
2025-05-07T20:31:57.0535945Z )
2025-05-07T20:31:57.0536271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0536909Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.0537195Z 
2025-05-07T20:31:57.0537278Z     @given(
2025-05-07T20:31:57.0537519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0537837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0538159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0538495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0538827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0539122Z     )
2025-05-07T20:31:57.0539480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0539947Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0540191Z         self,
2025-05-07T20:31:57.0540389Z         T: int,
2025-05-07T20:31:57.0540594Z         D: int,
2025-05-07T20:31:57.0540812Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0541089Z         contiguous: bool,
2025-05-07T20:31:57.0541334Z         compiled: bool,
2025-05-07T20:31:57.0541556Z     ) -> None:
2025-05-07T20:31:57.0541772Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0542024Z     
2025-05-07T20:31:57.0542307Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0544434Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.0546372Z 
2025-05-07T20:31:57.0546495Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.0546724Z 
2025-05-07T20:31:57.0546829Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0547253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0547747Z     T=4096,
2025-05-07T20:31:57.0547944Z     D=7168,
2025-05-07T20:31:57.0548141Z     scale_ub=None,
2025-05-07T20:31:57.0548352Z     contiguous=True,
2025-05-07T20:31:57.0548581Z     compiled=False,
2025-05-07T20:31:57.0548795Z )
2025-05-07T20:31:57.0549119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0549636Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.0549918Z 
2025-05-07T20:31:57.0550005Z     @given(
2025-05-07T20:31:57.0550242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0550557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0550869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0551218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0551551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0551841Z     )
2025-05-07T20:31:57.0552202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0552651Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0552893Z         self,
2025-05-07T20:31:57.0553088Z         T: int,
2025-05-07T20:31:57.0553282Z         D: int,
2025-05-07T20:31:57.0553500Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0553772Z         contiguous: bool,
2025-05-07T20:31:57.0554012Z         compiled: bool,
2025-05-07T20:31:57.0554236Z     ) -> None:
2025-05-07T20:31:57.0554456Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0554697Z     
2025-05-07T20:31:57.0554970Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0557093Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.0559114Z 
2025-05-07T20:31:57.0559235Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.0559452Z 
2025-05-07T20:31:57.0559558Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0559980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0560398Z     T=16384,
2025-05-07T20:31:57.0560592Z     D=7168,
2025-05-07T20:31:57.0560779Z     scale_ub=None,
2025-05-07T20:31:57.0560998Z     contiguous=True,
2025-05-07T20:31:57.0561229Z     compiled=False,
2025-05-07T20:31:57.0561444Z )
2025-05-07T20:31:57.0561769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0562284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.0562569Z 
2025-05-07T20:31:57.0562659Z     @given(
2025-05-07T20:31:57.0562888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0563211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0563531Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0563877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0564214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0571434Z     )
2025-05-07T20:31:57.0571894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0572359Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0572626Z         self,
2025-05-07T20:31:57.0572825Z         T: int,
2025-05-07T20:31:57.0573027Z         D: int,
2025-05-07T20:31:57.0573256Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0573529Z         contiguous: bool,
2025-05-07T20:31:57.0573772Z         compiled: bool,
2025-05-07T20:31:57.0573998Z     ) -> None:
2025-05-07T20:31:57.0574321Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0574574Z     
2025-05-07T20:31:57.0574857Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0576996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.0578949Z 
2025-05-07T20:31:57.0579076Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.0579295Z 
2025-05-07T20:31:57.0579405Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0579841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0580250Z     T=16384,
2025-05-07T20:31:57.0580444Z     D=7168,
2025-05-07T20:31:57.0580637Z     scale_ub=1200.0,
2025-05-07T20:31:57.0580861Z     contiguous=True,
2025-05-07T20:31:57.0581086Z     compiled=False,
2025-05-07T20:31:57.0581294Z )
2025-05-07T20:31:57.0581617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.0582130Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.0582418Z 
2025-05-07T20:31:57.0582502Z     @given(
2025-05-07T20:31:57.0582730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.0583051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.0583454Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.0583795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.0584129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.0584419Z     )
2025-05-07T20:31:57.0584787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.0585242Z     def test_silu_mul_quant(
2025-05-07T20:31:57.0585488Z         self,
2025-05-07T20:31:57.0585685Z         T: int,
2025-05-07T20:31:57.0585883Z         D: int,
2025-05-07T20:31:57.0586106Z         scale_ub: Optional[float],
2025-05-07T20:31:57.0586385Z         contiguous: bool,
2025-05-07T20:31:57.0586626Z         compiled: bool,
2025-05-07T20:31:57.0586852Z     ) -> None:
2025-05-07T20:31:57.0587075Z         torch.manual_seed(2025)
2025-05-07T20:31:57.0587322Z     
2025-05-07T20:31:57.0587600Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.0589740Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.0591742Z 
2025-05-07T20:31:57.0591865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.0592082Z 
2025-05-07T20:31:57.0592194Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.0592612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.0593029Z     T=128,
2025-05-07T20:31:57.0593215Z     D=5120,
2025-05-07T20:31:57.0593408Z     scale_ub=1200.0,
2025-05-07T20:31:57.0593641Z     contiguous=False,
2025-05-07T20:31:57.0593872Z     compiled=False,
2025-05-07T20:31:57.0594077Z )
2025-05-07T20:31:57.1843329Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1845179Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.1845920Z 
2025-05-07T20:31:57.1846084Z     @given(
2025-05-07T20:31:57.1846552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1847172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1847781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1848438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1849081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1849655Z     )
2025-05-07T20:31:57.1850361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1851253Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1851645Z         self,
2025-05-07T20:31:57.1851964Z         T: int,
2025-05-07T20:31:57.1852168Z         D: int,
2025-05-07T20:31:57.1852390Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1852667Z         contiguous: bool,
2025-05-07T20:31:57.1852918Z         compiled: bool,
2025-05-07T20:31:57.1853145Z     ) -> None:
2025-05-07T20:31:57.1853373Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1853624Z     
2025-05-07T20:31:57.1853902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1854253Z     
2025-05-07T20:31:57.1854457Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1854748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1855070Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1855320Z         x0 = x[:, :D]
2025-05-07T20:31:57.1855539Z         x1 = x[:, D:]
2025-05-07T20:31:57.1855757Z     
2025-05-07T20:31:57.1855953Z         if contiguous:
2025-05-07T20:31:57.1856191Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1856601Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1856847Z     
2025-05-07T20:31:57.1857050Z         if scale_ub is not None:
2025-05-07T20:31:57.1857323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1857671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1857987Z             )
2025-05-07T20:31:57.1858179Z         else:
2025-05-07T20:31:57.1858401Z             scale_ub_tensor = None
2025-05-07T20:31:57.1858660Z     
2025-05-07T20:31:57.1858894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1859216Z             op = silu_mul_quant
2025-05-07T20:31:57.1859478Z             if compiled:
2025-05-07T20:31:57.1859731Z                 op = torch.compile(op)
2025-05-07T20:31:57.1860039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1860321Z     
2025-05-07T20:31:57.1860520Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.1860689Z 
2025-05-07T20:31:57.1860793Z moe/activation_test.py:117: 
2025-05-07T20:31:57.1861109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1861455Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.1861743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1862467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.1863184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.1863734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1864438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1865122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1865673Z     kernel = self.compile(
2025-05-07T20:31:57.1866229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1866910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1867325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1867562Z 
2025-05-07T20:31:57.1867863Z self = <triton.compiler.compiler.ASTSource object at 0x7f4474a65250>
2025-05-07T20:31:57.1868985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1870413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474ab2fc0>}
2025-05-07T20:31:57.1871808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1872876Z context = <triton._C.libtriton.ir.context object at 0x7f4474701e30>
2025-05-07T20:31:57.1873174Z 
2025-05-07T20:31:57.1873350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1873891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1874378Z                            module_map=module_map)
2025-05-07T20:31:57.1874753Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1875113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.1875379Z E       ^
2025-05-07T20:31:57.1875857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1876320Z 
2025-05-07T20:31:57.1876753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1877369Z 
2025-05-07T20:31:57.1877475Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1877905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1878325Z     T=2048,
2025-05-07T20:31:57.1878523Z     D=7168,
2025-05-07T20:31:57.1878721Z     scale_ub=None,
2025-05-07T20:31:57.1878950Z     contiguous=False,
2025-05-07T20:31:57.1879179Z     compiled=False,
2025-05-07T20:31:57.1879391Z )
2025-05-07T20:31:57.1879723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1880236Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.1880525Z 
2025-05-07T20:31:57.1880605Z     @given(
2025-05-07T20:31:57.1880841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1881160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1881474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1881813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1882157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1882446Z     )
2025-05-07T20:31:57.1882807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1883267Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1883510Z         self,
2025-05-07T20:31:57.1883707Z         T: int,
2025-05-07T20:31:57.1883913Z         D: int,
2025-05-07T20:31:57.1884132Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1884409Z         contiguous: bool,
2025-05-07T20:31:57.1884657Z         compiled: bool,
2025-05-07T20:31:57.1884886Z     ) -> None:
2025-05-07T20:31:57.1885108Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1885356Z     
2025-05-07T20:31:57.1885632Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1887857Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.1889804Z 
2025-05-07T20:31:57.1889928Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.1890150Z 
2025-05-07T20:31:57.1890253Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1890676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1891090Z     T=128,
2025-05-07T20:31:57.1891279Z     D=7168,
2025-05-07T20:31:57.1891476Z     scale_ub=1200.0,
2025-05-07T20:31:57.1891701Z     contiguous=True,
2025-05-07T20:31:57.1891984Z     compiled=True,
2025-05-07T20:31:57.1892193Z )
2025-05-07T20:31:57.2193605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2195061Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.2195799Z 
2025-05-07T20:31:57.2196016Z     @given(
2025-05-07T20:31:57.2196640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2197280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2197910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2198575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2199243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2199817Z     )
2025-05-07T20:31:57.2200520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2201390Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2201638Z         self,
2025-05-07T20:31:57.2201844Z         T: int,
2025-05-07T20:31:57.2202045Z         D: int,
2025-05-07T20:31:57.2202274Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2202755Z         contiguous: bool,
2025-05-07T20:31:57.2203001Z         compiled: bool,
2025-05-07T20:31:57.2203230Z     ) -> None:
2025-05-07T20:31:57.2203453Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2203704Z     
2025-05-07T20:31:57.2203986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2204341Z     
2025-05-07T20:31:57.2204542Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2204837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2205159Z         x = x_sign * x_clamp
2025-05-07T20:31:57.2205402Z         x0 = x[:, :D]
2025-05-07T20:31:57.2205619Z         x1 = x[:, D:]
2025-05-07T20:31:57.2205834Z     
2025-05-07T20:31:57.2206026Z         if contiguous:
2025-05-07T20:31:57.2206451Z             x0 = x0.contiguous()
2025-05-07T20:31:57.2206722Z             x1 = x1.contiguous()
2025-05-07T20:31:57.2206961Z     
2025-05-07T20:31:57.2207153Z         if scale_ub is not None:
2025-05-07T20:31:57.2207443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.2207790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.2208109Z             )
2025-05-07T20:31:57.2208311Z         else:
2025-05-07T20:31:57.2208536Z             scale_ub_tensor = None
2025-05-07T20:31:57.2208789Z     
2025-05-07T20:31:57.2209030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.2209352Z             op = silu_mul_quant
2025-05-07T20:31:57.2209612Z             if compiled:
2025-05-07T20:31:57.2209863Z                 op = torch.compile(op)
2025-05-07T20:31:57.2210166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2210449Z     
2025-05-07T20:31:57.2210639Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.2210811Z 
2025-05-07T20:31:57.2210913Z moe/activation_test.py:117: 
2025-05-07T20:31:57.2211224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2211558Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.2211908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2212492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.2213199Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.2213880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.2214602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.2215162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.2215865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.2216551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.2217103Z     kernel = self.compile(
2025-05-07T20:31:57.2217667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.2218344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.2218762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2219004Z 
2025-05-07T20:31:57.2219226Z self = <triton.compiler.compiler.ASTSource object at 0x7f44747efe00>
2025-05-07T20:31:57.2220348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.2221768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44747c5120>}
2025-05-07T20:31:57.2223159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.2224336Z context = <triton._C.libtriton.ir.context object at 0x7f44747ddf70>
2025-05-07T20:31:57.2224632Z 
2025-05-07T20:31:57.2224815Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.2225349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.2225830Z                            module_map=module_map)
2025-05-07T20:31:57.2226203Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.2226565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.2226824Z E       ^
2025-05-07T20:31:57.2227299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2227765Z 
2025-05-07T20:31:57.2228197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.2228732Z 
2025-05-07T20:31:57.2228843Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2229265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2229685Z     T=128,
2025-05-07T20:31:57.2229877Z     D=7168,
2025-05-07T20:31:57.2230071Z     scale_ub=1200.0,
2025-05-07T20:31:57.2230296Z     contiguous=True,
2025-05-07T20:31:57.2230522Z     compiled=False,
2025-05-07T20:31:57.2230726Z )
2025-05-07T20:31:57.2231055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2231561Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.2231840Z 
2025-05-07T20:31:57.2231918Z     @given(
2025-05-07T20:31:57.2232151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2232469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2232781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2233120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2233456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2233748Z     )
2025-05-07T20:31:57.2234182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2234637Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2234884Z         self,
2025-05-07T20:31:57.2235075Z         T: int,
2025-05-07T20:31:57.2235276Z         D: int,
2025-05-07T20:31:57.2235496Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2235766Z         contiguous: bool,
2025-05-07T20:31:57.2236009Z         compiled: bool,
2025-05-07T20:31:57.2236233Z     ) -> None:
2025-05-07T20:31:57.2236447Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2236692Z     
2025-05-07T20:31:57.2236974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2237317Z     
2025-05-07T20:31:57.2237518Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2237825Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2239910Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.2241836Z 
2025-05-07T20:31:57.2241963Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:57.2242180Z 
2025-05-07T20:31:57.2242283Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2242705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2243121Z     T=128,
2025-05-07T20:31:57.2243398Z     D=5120,
2025-05-07T20:31:57.2243593Z     scale_ub=1200.0,
2025-05-07T20:31:57.2243817Z     contiguous=True,
2025-05-07T20:31:57.2244044Z     compiled=True,
2025-05-07T20:31:57.2244243Z )
2025-05-07T20:31:57.2244576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2245082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.2245355Z 
2025-05-07T20:31:57.2245433Z     @given(
2025-05-07T20:31:57.2245666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2245984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2246290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2246625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2246962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2247254Z     )
2025-05-07T20:31:57.2247608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2248068Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2248313Z         self,
2025-05-07T20:31:57.2248506Z         T: int,
2025-05-07T20:31:57.2248707Z         D: int,
2025-05-07T20:31:57.2248933Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2249205Z         contiguous: bool,
2025-05-07T20:31:57.2249446Z         compiled: bool,
2025-05-07T20:31:57.2249688Z     ) -> None:
2025-05-07T20:31:57.2249904Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2250149Z     
2025-05-07T20:31:57.2250428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2250768Z     
2025-05-07T20:31:57.2250963Z >       x_sign = torch.sign(x)
2025-05-07T20:31:57.2253113Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.2255049Z 
2025-05-07T20:31:57.2255168Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:57.2255385Z 
2025-05-07T20:31:57.2255489Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2255905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2256318Z     T=128,
2025-05-07T20:31:57.2256507Z     D=7168,
2025-05-07T20:31:57.2256694Z     scale_ub=None,
2025-05-07T20:31:57.2256913Z     contiguous=True,
2025-05-07T20:31:57.2257134Z     compiled=True,
2025-05-07T20:31:57.2257330Z )
2025-05-07T20:31:57.5736031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5736762Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.5737108Z 
2025-05-07T20:31:57.5737188Z     @given(
2025-05-07T20:31:57.5737425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5737742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5738053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5738388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5738713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5738999Z     )
2025-05-07T20:31:57.5739355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5739817Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5740062Z         self,
2025-05-07T20:31:57.5740261Z         T: int,
2025-05-07T20:31:57.5740467Z         D: int,
2025-05-07T20:31:57.5740688Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5740964Z         contiguous: bool,
2025-05-07T20:31:57.5741210Z         compiled: bool,
2025-05-07T20:31:57.5741619Z     ) -> None:
2025-05-07T20:31:57.5741840Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5742093Z     
2025-05-07T20:31:57.5742370Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5744510Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.5746455Z 
2025-05-07T20:31:57.5746576Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.5746800Z 
2025-05-07T20:31:57.5830314Z FAILED
2025-05-07T20:31:57.5830520Z 
2025-05-07T20:31:57.5830718Z =================================== FAILURES ===================================
2025-05-07T20:31:57.5831330Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:57.5831958Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:57.5832834Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:31:57.5833594Z   |     yield
2025-05-07T20:31:57.5834201Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:31:57.5834938Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:57.5835721Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:31:57.5836500Z   |     if method() is not None:
2025-05-07T20:31:57.5836844Z   |        ^^^^^^^^
2025-05-07T20:31:57.5837735Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:57.5838775Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5850170Z   |            ^^^^^^^
2025-05-07T20:31:57.5851028Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:57.5852003Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:57.5852617Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:57.5853210Z   +-+---------------- 1 ----------------
2025-05-07T20:31:57.5853603Z     | Traceback (most recent call last):
2025-05-07T20:31:57.5854601Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:57.5855704Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5856237Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5859042Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.5861609Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:57.5862069Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5862495Z     |     T=128,
2025-05-07T20:31:57.5862709Z     |     D=7168,
2025-05-07T20:31:57.5863430Z     |     scale_ub=1200.0,
2025-05-07T20:31:57.5863679Z     |     contiguous=True,
2025-05-07T20:31:57.5863924Z     |     compiled=False,
2025-05-07T20:31:57.5864153Z     | )
2025-05-07T20:31:57.5864337Z     | 
2025-05-07T20:31:57.5864883Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:57.5865510Z     +---------------- 2 ----------------
2025-05-07T20:31:57.5865809Z     | Traceback (most recent call last):
2025-05-07T20:31:57.5866537Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:57.5867327Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5867703Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5869757Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.5871809Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:57.5872260Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5872679Z     |     T=128,
2025-05-07T20:31:57.5872878Z     |     D=7168,
2025-05-07T20:31:57.5873091Z     |     scale_ub=None,
2025-05-07T20:31:57.5873330Z     |     contiguous=True,
2025-05-07T20:31:57.5873568Z     |     compiled=True,
2025-05-07T20:31:57.5873793Z     | )
2025-05-07T20:31:57.5873976Z     | 
2025-05-07T20:31:57.5874512Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:57.5875135Z     +---------------- 3 ----------------
2025-05-07T20:31:57.5875518Z     | Traceback (most recent call last):
2025-05-07T20:31:57.5876244Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:57.5877044Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5877426Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5879490Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.5882200Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:57.5882810Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5883378Z     |     T=128,
2025-05-07T20:31:57.5883656Z     |     D=5120,
2025-05-07T20:31:57.5883937Z     |     scale_ub=1200.0,
2025-05-07T20:31:57.5884290Z     |     contiguous=True,
2025-05-07T20:31:57.5884621Z     |     compiled=True,
2025-05-07T20:31:57.5884934Z     | )
2025-05-07T20:31:57.5885168Z     | 
2025-05-07T20:31:57.5885896Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:57.5886745Z     +---------------- 4 ----------------
2025-05-07T20:31:57.5887248Z     | Traceback (most recent call last):
2025-05-07T20:31:57.5888280Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:57.5889283Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.5889685Z     |                              ^^^^^^^^
2025-05-07T20:31:57.5890574Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:57.5891537Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.5892099Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5893211Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:57.5894333Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.5895180Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:57.5896212Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.5896806Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5897467Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:57.5898264Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.5898741Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5899396Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:57.5900119Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.5900497Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5901206Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:57.5901788Z     |     fn()
2025-05-07T20:31:57.5902376Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:57.5903027Z     |     self.fn.run(
2025-05-07T20:31:57.5903568Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:57.5904293Z     |     kernel = self.compile(
2025-05-07T20:31:57.5904649Z     |              ^^^^^^^^^^^^^
2025-05-07T20:31:57.5905418Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:57.5906351Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.5906750Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5907422Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.5908244Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.5908729Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.5909119Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.5909480Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.5909746Z     | ^
2025-05-07T20:31:57.5910221Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.5910962Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:57.5911374Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:57.5911906Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5912360Z     |     T=1,  # or any other generated value
2025-05-07T20:31:57.5912677Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:57.5913017Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:57.5913389Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:57.5913762Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:57.5914064Z     | )
2025-05-07T20:31:57.5914247Z     | 
2025-05-07T20:31:57.5914789Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:57.5915412Z     +------------------------------------
2025-05-07T20:31:57.5915787Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:57.5916173Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.5916595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5917007Z     T=1,
2025-05-07T20:31:57.5917194Z     D=5120,
2025-05-07T20:31:57.5917451Z     scale_ub=None,
2025-05-07T20:31:57.5917734Z     contiguous=True,
2025-05-07T20:31:57.5918042Z     compiled=True,
2025-05-07T20:31:57.5918324Z )
2025-05-07T20:31:57.5918759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5919429Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.5919798Z 
2025-05-07T20:31:57.5919903Z     @given(
2025-05-07T20:31:57.5920218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5938101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5938591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5939120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5939579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5939997Z     )
2025-05-07T20:31:57.5940909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5941539Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5941889Z         self,
2025-05-07T20:31:57.5942165Z         T: int,
2025-05-07T20:31:57.5942439Z         D: int,
2025-05-07T20:31:57.5942776Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5943166Z         contiguous: bool,
2025-05-07T20:31:57.5943510Z         compiled: bool,
2025-05-07T20:31:57.5943824Z     ) -> None:
2025-05-07T20:31:57.5944136Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5944481Z     
2025-05-07T20:31:57.5944855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5945345Z     
2025-05-07T20:31:57.5945627Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.5946048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.5946482Z         x = x_sign * x_clamp
2025-05-07T20:31:57.5946829Z         x0 = x[:, :D]
2025-05-07T20:31:57.5947141Z         x1 = x[:, D:]
2025-05-07T20:31:57.5947434Z     
2025-05-07T20:31:57.5947709Z         if contiguous:
2025-05-07T20:31:57.5948042Z             x0 = x0.contiguous()
2025-05-07T20:31:57.5948403Z             x1 = x1.contiguous()
2025-05-07T20:31:57.5948747Z     
2025-05-07T20:31:57.5949025Z         if scale_ub is not None:
2025-05-07T20:31:57.5949405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.5949867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.5950297Z             )
2025-05-07T20:31:57.5950575Z         else:
2025-05-07T20:31:57.5950867Z             scale_ub_tensor = None
2025-05-07T20:31:57.5951234Z     
2025-05-07T20:31:57.5951606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5952072Z             op = silu_mul_quant
2025-05-07T20:31:57.5952572Z             if compiled:
2025-05-07T20:31:57.5952932Z                 op = torch.compile(op)
2025-05-07T20:31:57.5953333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5953712Z     
2025-05-07T20:31:57.5953993Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.5954390Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.5954804Z     
2025-05-07T20:31:57.5955143Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5955612Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.5956036Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.5956482Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.5956985Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.5957434Z     
2025-05-07T20:31:57.5957721Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.5957995Z 
2025-05-07T20:31:57.5958143Z moe/activation_test.py:126: 
2025-05-07T20:31:57.5958567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5959042Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.5959507Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.5960593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.5961656Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.5962430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.5963369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.5964307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.5965335Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.5966411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.5967351Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.5968285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.5969064Z     fn()
2025-05-07T20:31:57.5969784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.5970580Z     self.fn.run(
2025-05-07T20:31:57.5971207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.5972029Z     kernel = self.compile(
2025-05-07T20:31:57.5972750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.5973638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.5974190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5974509Z 
2025-05-07T20:31:57.5974794Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484d7f530>
2025-05-07T20:31:57.5976283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.5978604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486d9fba0>}
2025-05-07T20:31:57.5980488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.5981929Z context = <triton._C.libtriton.ir.context object at 0x7f4484de7570>
2025-05-07T20:31:57.5982426Z 
2025-05-07T20:31:57.5982654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.5983356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.5983999Z                            module_map=module_map)
2025-05-07T20:31:57.5984488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.5984953Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.5985331Z E       ^
2025-05-07T20:31:57.5985993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.5986591Z 
2025-05-07T20:31:57.5987178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.5987898Z 
2025-05-07T20:31:57.5988027Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.5988564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5989116Z     T=2048,
2025-05-07T20:31:57.5989356Z     D=5120,
2025-05-07T20:31:57.5989617Z     scale_ub=1200.0,
2025-05-07T20:31:57.5989930Z     contiguous=True,
2025-05-07T20:31:57.5990230Z     compiled=False,
2025-05-07T20:31:57.5990498Z )
2025-05-07T20:31:57.5990925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5991593Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.5991966Z 
2025-05-07T20:31:57.5992068Z     @given(
2025-05-07T20:31:57.5992372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5992793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5993202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5993647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5994088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5994471Z     )
2025-05-07T20:31:57.5994944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5995540Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5995869Z         self,
2025-05-07T20:31:57.5996216Z         T: int,
2025-05-07T20:31:57.5996486Z         D: int,
2025-05-07T20:31:57.5996780Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5997132Z         contiguous: bool,
2025-05-07T20:31:57.5997454Z         compiled: bool,
2025-05-07T20:31:57.5997755Z     ) -> None:
2025-05-07T20:31:57.5998039Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5998367Z     
2025-05-07T20:31:57.5998741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5999204Z     
2025-05-07T20:31:57.5999475Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.5999875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6000302Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6000635Z         x0 = x[:, :D]
2025-05-07T20:31:57.6000942Z         x1 = x[:, D:]
2025-05-07T20:31:57.6001231Z     
2025-05-07T20:31:57.6001500Z         if contiguous:
2025-05-07T20:31:57.6001829Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6002197Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6002520Z     
2025-05-07T20:31:57.6002786Z         if scale_ub is not None:
2025-05-07T20:31:57.6003159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6003609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6004035Z             )
2025-05-07T20:31:57.6004299Z         else:
2025-05-07T20:31:57.6004579Z             scale_ub_tensor = None
2025-05-07T20:31:57.6004925Z     
2025-05-07T20:31:57.6005244Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6005671Z             op = silu_mul_quant
2025-05-07T20:31:57.6006009Z             if compiled:
2025-05-07T20:31:57.6006618Z                 op = torch.compile(op)
2025-05-07T20:31:57.6007204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6007557Z     
2025-05-07T20:31:57.6007801Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6008005Z 
2025-05-07T20:31:57.6008138Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6008504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6008919Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6009283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6010193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6011150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6011983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6012866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6013719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6014422Z     kernel = self.compile(
2025-05-07T20:31:57.6015128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6015984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6016490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6016790Z 
2025-05-07T20:31:57.6017043Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487688a70>
2025-05-07T20:31:57.6018532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6020458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44857e9620>}
2025-05-07T20:31:57.6023989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6025398Z context = <triton._C.libtriton.ir.context object at 0x7f4484c2bb70>
2025-05-07T20:31:57.6025782Z 
2025-05-07T20:31:57.6026014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6026725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6027350Z                            module_map=module_map)
2025-05-07T20:31:57.6027833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6028317Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6028661Z E       ^
2025-05-07T20:31:57.6029295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6029925Z 
2025-05-07T20:31:57.6030494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6031201Z 
2025-05-07T20:31:57.6031345Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6031891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6032435Z     T=2048,
2025-05-07T20:31:57.6032685Z     D=5120,
2025-05-07T20:31:57.6032933Z     scale_ub=1200.0,
2025-05-07T20:31:57.6033232Z     contiguous=True,
2025-05-07T20:31:57.6033529Z     compiled=True,
2025-05-07T20:31:57.6033790Z )
2025-05-07T20:31:57.6034215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6034887Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6035265Z 
2025-05-07T20:31:57.6035384Z     @given(
2025-05-07T20:31:57.6035685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6036220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6036642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6037095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6037550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6037944Z     )
2025-05-07T20:31:57.6038420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6039030Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6039363Z         self,
2025-05-07T20:31:57.6039626Z         T: int,
2025-05-07T20:31:57.6039897Z         D: int,
2025-05-07T20:31:57.6040185Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6040537Z         contiguous: bool,
2025-05-07T20:31:57.6040861Z         compiled: bool,
2025-05-07T20:31:57.6041152Z     ) -> None:
2025-05-07T20:31:57.6041432Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6041758Z     
2025-05-07T20:31:57.6042134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6042593Z     
2025-05-07T20:31:57.6042847Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6043232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6043651Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6043961Z         x0 = x[:, :D]
2025-05-07T20:31:57.6044270Z         x1 = x[:, D:]
2025-05-07T20:31:57.6044556Z     
2025-05-07T20:31:57.6044795Z         if contiguous:
2025-05-07T20:31:57.6045101Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6045454Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6045771Z     
2025-05-07T20:31:57.6046030Z         if scale_ub is not None:
2025-05-07T20:31:57.6046392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6046827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6047241Z             )
2025-05-07T20:31:57.6047505Z         else:
2025-05-07T20:31:57.6047777Z             scale_ub_tensor = None
2025-05-07T20:31:57.6048118Z     
2025-05-07T20:31:57.6048422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6048834Z             op = silu_mul_quant
2025-05-07T20:31:57.6069694Z             if compiled:
2025-05-07T20:31:57.6070215Z                 op = torch.compile(op)
2025-05-07T20:31:57.6070632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6071003Z     
2025-05-07T20:31:57.6071266Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6071698Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6072079Z     
2025-05-07T20:31:57.6072394Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6072855Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6073258Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6073675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6074154Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6074576Z     
2025-05-07T20:31:57.6074830Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6075089Z 
2025-05-07T20:31:57.6075223Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6075628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6076078Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6076520Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6077581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6078617Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6079370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6080312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6081273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6082387Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6083329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6084180Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6085015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6085725Z     fn()
2025-05-07T20:31:57.6086421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6087222Z     self.fn.run(
2025-05-07T20:31:57.6087857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6088588Z     kernel = self.compile(
2025-05-07T20:31:57.6089345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6090249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6090815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6091146Z 
2025-05-07T20:31:57.6091428Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484df52b0>
2025-05-07T20:31:57.6092898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6094326Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f448577e980>}
2025-05-07T20:31:57.6095717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6096773Z context = <triton._C.libtriton.ir.context object at 0x7f4484783670>
2025-05-07T20:31:57.6097167Z 
2025-05-07T20:31:57.6097347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6097923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6098408Z                            module_map=module_map)
2025-05-07T20:31:57.6098784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6099147Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6099409Z E       ^
2025-05-07T20:31:57.6099880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6100343Z 
2025-05-07T20:31:57.6100776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6101312Z 
2025-05-07T20:31:57.6101424Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6101848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6102262Z     T=16384,
2025-05-07T20:31:57.6102463Z     D=7168,
2025-05-07T20:31:57.6102659Z     scale_ub=1200.0,
2025-05-07T20:31:57.6102892Z     contiguous=False,
2025-05-07T20:31:57.6103124Z     compiled=False,
2025-05-07T20:31:57.6103332Z )
2025-05-07T20:31:57.6103680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6104232Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6104519Z 
2025-05-07T20:31:57.6104606Z     @given(
2025-05-07T20:31:57.6104834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6105156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6105469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6105889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6106797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6107128Z     )
2025-05-07T20:31:57.6107499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6108021Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6108283Z         self,
2025-05-07T20:31:57.6108484Z         T: int,
2025-05-07T20:31:57.6108692Z         D: int,
2025-05-07T20:31:57.6108924Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6109216Z         contiguous: bool,
2025-05-07T20:31:57.6109480Z         compiled: bool,
2025-05-07T20:31:57.6109727Z     ) -> None:
2025-05-07T20:31:57.6109955Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6110213Z     
2025-05-07T20:31:57.6110505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6110887Z     
2025-05-07T20:31:57.6111084Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6111421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6111802Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6112058Z         x0 = x[:, :D]
2025-05-07T20:31:57.6112293Z         x1 = x[:, D:]
2025-05-07T20:31:57.6112512Z     
2025-05-07T20:31:57.6112700Z         if contiguous:
2025-05-07T20:31:57.6112951Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6113231Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6113483Z     
2025-05-07T20:31:57.6113682Z         if scale_ub is not None:
2025-05-07T20:31:57.6113982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6114344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6114690Z             )
2025-05-07T20:31:57.6114890Z         else:
2025-05-07T20:31:57.6115111Z             scale_ub_tensor = None
2025-05-07T20:31:57.6115373Z     
2025-05-07T20:31:57.6115618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6115970Z             op = silu_mul_quant
2025-05-07T20:31:57.6116235Z             if compiled:
2025-05-07T20:31:57.6116499Z                 op = torch.compile(op)
2025-05-07T20:31:57.6116826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6117365Z     
2025-05-07T20:31:57.6117563Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6117730Z 
2025-05-07T20:31:57.6117836Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6118130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6118472Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6118759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6119474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6120178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6120723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6121426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6122110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6122660Z     kernel = self.compile(
2025-05-07T20:31:57.6123214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6123886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6124288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6124530Z 
2025-05-07T20:31:57.6124740Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d546e0>
2025-05-07T20:31:57.6125856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6127419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44849e1620>}
2025-05-07T20:31:57.6128807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6129861Z context = <triton._C.libtriton.ir.context object at 0x7f4477d2c930>
2025-05-07T20:31:57.6130159Z 
2025-05-07T20:31:57.6130328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6130863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6131334Z                            module_map=module_map)
2025-05-07T20:31:57.6131703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6132238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6132497Z E       ^
2025-05-07T20:31:57.6132966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6133438Z 
2025-05-07T20:31:57.6133866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6134395Z 
2025-05-07T20:31:57.6134506Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6134923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6135336Z     T=1,
2025-05-07T20:31:57.6135524Z     D=7168,
2025-05-07T20:31:57.6135721Z     scale_ub=None,
2025-05-07T20:31:57.6135933Z     contiguous=True,
2025-05-07T20:31:57.6136158Z     compiled=True,
2025-05-07T20:31:57.6136366Z )
2025-05-07T20:31:57.6136686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6137187Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6137452Z 
2025-05-07T20:31:57.6137534Z     @given(
2025-05-07T20:31:57.6137762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6138175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6138488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6138815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6139153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6139445Z     )
2025-05-07T20:31:57.6139800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6140248Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6140492Z         self,
2025-05-07T20:31:57.6140689Z         T: int,
2025-05-07T20:31:57.6140883Z         D: int,
2025-05-07T20:31:57.6141104Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6141391Z         contiguous: bool,
2025-05-07T20:31:57.6141674Z         compiled: bool,
2025-05-07T20:31:57.6141902Z     ) -> None:
2025-05-07T20:31:57.6142117Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6142358Z     
2025-05-07T20:31:57.6142642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6142994Z     
2025-05-07T20:31:57.6143183Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6143480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6143796Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6144038Z         x0 = x[:, :D]
2025-05-07T20:31:57.6144256Z         x1 = x[:, D:]
2025-05-07T20:31:57.6144465Z     
2025-05-07T20:31:57.6144650Z         if contiguous:
2025-05-07T20:31:57.6144878Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6145143Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6145386Z     
2025-05-07T20:31:57.6145576Z         if scale_ub is not None:
2025-05-07T20:31:57.6145853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6146195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6147158Z             )
2025-05-07T20:31:57.6147355Z         else:
2025-05-07T20:31:57.6147572Z             scale_ub_tensor = None
2025-05-07T20:31:57.6147822Z     
2025-05-07T20:31:57.6148066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6148389Z             op = silu_mul_quant
2025-05-07T20:31:57.6148642Z             if compiled:
2025-05-07T20:31:57.6148895Z                 op = torch.compile(op)
2025-05-07T20:31:57.6149200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6149473Z     
2025-05-07T20:31:57.6149671Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6149968Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6150263Z     
2025-05-07T20:31:57.6150496Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6150833Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6151132Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6151453Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6151828Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6152152Z     
2025-05-07T20:31:57.6152361Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6152559Z 
2025-05-07T20:31:57.6152661Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6152967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6153309Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6153638Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6154458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6155236Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6155798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6156498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6157206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6158043Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6158795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6159452Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6160075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6160606Z     fn()
2025-05-07T20:31:57.6161122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6161724Z     self.fn.run(
2025-05-07T20:31:57.6162206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6162756Z     kernel = self.compile(
2025-05-07T20:31:57.6163310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6163983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6164392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6164626Z 
2025-05-07T20:31:57.6164844Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d1cef0>
2025-05-07T20:31:57.6165958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6167386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486da8b80>}
2025-05-07T20:31:57.6168869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6169930Z context = <triton._C.libtriton.ir.context object at 0x7f4484296970>
2025-05-07T20:31:57.6170224Z 
2025-05-07T20:31:57.6170393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6170930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6171412Z                            module_map=module_map)
2025-05-07T20:31:57.6171873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6172231Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6172504Z E       ^
2025-05-07T20:31:57.6172982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6173453Z 
2025-05-07T20:31:57.6173886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6174419Z 
2025-05-07T20:31:57.6174524Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6174945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6175359Z     T=4096,
2025-05-07T20:31:57.6175543Z     D=5120,
2025-05-07T20:31:57.6175738Z     scale_ub=None,
2025-05-07T20:31:57.6175958Z     contiguous=False,
2025-05-07T20:31:57.6176181Z     compiled=False,
2025-05-07T20:31:57.6176391Z )
2025-05-07T20:31:57.6176716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6177224Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6177514Z 
2025-05-07T20:31:57.6177591Z     @given(
2025-05-07T20:31:57.6177831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6178149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6178456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6178883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6179220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6179505Z     )
2025-05-07T20:31:57.6179863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6180318Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6180561Z         self,
2025-05-07T20:31:57.6180759Z         T: int,
2025-05-07T20:31:57.6180959Z         D: int,
2025-05-07T20:31:57.6181174Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6181450Z         contiguous: bool,
2025-05-07T20:31:57.6181696Z         compiled: bool,
2025-05-07T20:31:57.6181919Z     ) -> None:
2025-05-07T20:31:57.6182136Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6182385Z     
2025-05-07T20:31:57.6182664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6183005Z     
2025-05-07T20:31:57.6183205Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6183508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6183816Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6184062Z         x0 = x[:, :D]
2025-05-07T20:31:57.6184278Z         x1 = x[:, D:]
2025-05-07T20:31:57.6184481Z     
2025-05-07T20:31:57.6184669Z         if contiguous:
2025-05-07T20:31:57.6184901Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6185157Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6185401Z     
2025-05-07T20:31:57.6185594Z         if scale_ub is not None:
2025-05-07T20:31:57.6185862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6186202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6186516Z             )
2025-05-07T20:31:57.6186704Z         else:
2025-05-07T20:31:57.6187040Z             scale_ub_tensor = None
2025-05-07T20:31:57.6187296Z     
2025-05-07T20:31:57.6187524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6187845Z             op = silu_mul_quant
2025-05-07T20:31:57.6188103Z             if compiled:
2025-05-07T20:31:57.6188353Z                 op = torch.compile(op)
2025-05-07T20:31:57.6188648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6188927Z     
2025-05-07T20:31:57.6189120Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6189287Z 
2025-05-07T20:31:57.6189388Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6189689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6190024Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6190302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6191012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6191726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6192328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6193028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6193713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6194262Z     kernel = self.compile(
2025-05-07T20:31:57.6194812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6195488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6195898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6196131Z 
2025-05-07T20:31:57.6196348Z self = <triton.compiler.compiler.ASTSource object at 0x7f44849c7b90>
2025-05-07T20:31:57.6197460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6198972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7c180>}
2025-05-07T20:31:57.6200368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6201429Z context = <triton._C.libtriton.ir.context object at 0x7f44776517b0>
2025-05-07T20:31:57.6201726Z 
2025-05-07T20:31:57.6201902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6202434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6202922Z                            module_map=module_map)
2025-05-07T20:31:57.6203299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6203654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6203928Z E       ^
2025-05-07T20:31:57.6204405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6204868Z 
2025-05-07T20:31:57.6205300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6205827Z 
2025-05-07T20:31:57.6205931Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6206689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6207111Z     T=4096,
2025-05-07T20:31:57.6207294Z     D=7168,
2025-05-07T20:31:57.6207491Z     scale_ub=None,
2025-05-07T20:31:57.6207710Z     contiguous=False,
2025-05-07T20:31:57.6208130Z     compiled=False,
2025-05-07T20:31:57.6208331Z )
2025-05-07T20:31:57.6208660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6209184Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6209472Z 
2025-05-07T20:31:57.6209548Z     @given(
2025-05-07T20:31:57.6209788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6210109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6210411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6210745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6211079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6211379Z     )
2025-05-07T20:31:57.6211851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6212309Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6212553Z         self,
2025-05-07T20:31:57.6212745Z         T: int,
2025-05-07T20:31:57.6212959Z         D: int,
2025-05-07T20:31:57.6213181Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6213451Z         contiguous: bool,
2025-05-07T20:31:57.6213696Z         compiled: bool,
2025-05-07T20:31:57.6213932Z     ) -> None:
2025-05-07T20:31:57.6214149Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6214397Z     
2025-05-07T20:31:57.6214677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6215020Z     
2025-05-07T20:31:57.6215220Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6215511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6215816Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6216056Z         x0 = x[:, :D]
2025-05-07T20:31:57.6216268Z         x1 = x[:, D:]
2025-05-07T20:31:57.6216470Z     
2025-05-07T20:31:57.6216655Z         if contiguous:
2025-05-07T20:31:57.6216885Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6217143Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6217381Z     
2025-05-07T20:31:57.6217570Z         if scale_ub is not None:
2025-05-07T20:31:57.6217843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6218177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6218627Z             )
2025-05-07T20:31:57.6218826Z         else:
2025-05-07T20:31:57.6219026Z             scale_ub_tensor = None
2025-05-07T20:31:57.6219279Z     
2025-05-07T20:31:57.6219508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6219915Z             op = silu_mul_quant
2025-05-07T20:31:57.6220256Z             if compiled:
2025-05-07T20:31:57.6220509Z                 op = torch.compile(op)
2025-05-07T20:31:57.6220803Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6221081Z     
2025-05-07T20:31:57.6221274Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6221440Z 
2025-05-07T20:31:57.6221547Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6221866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6222238Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6222521Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6223235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6223955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6224508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6225207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6225897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6226451Z     kernel = self.compile(
2025-05-07T20:31:57.6227010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6227781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6228191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6228434Z 
2025-05-07T20:31:57.6228650Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484c0e6c0>
2025-05-07T20:31:57.6229770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6231193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7cfe0>}
2025-05-07T20:31:57.6232645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6233737Z context = <triton._C.libtriton.ir.context object at 0x7f447757e630>
2025-05-07T20:31:57.6234045Z 
2025-05-07T20:31:57.6242586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6243148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6243627Z                            module_map=module_map)
2025-05-07T20:31:57.6243800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6243900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6243984Z E       ^
2025-05-07T20:31:57.6244357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6244363Z 
2025-05-07T20:31:57.6244795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6244799Z 
2025-05-07T20:31:57.6244918Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6245148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6245230Z     T=128,
2025-05-07T20:31:57.6245318Z     D=7168,
2025-05-07T20:31:57.6245514Z     scale_ub=None,
2025-05-07T20:31:57.6245613Z     contiguous=False,
2025-05-07T20:31:57.6245697Z     compiled=True,
2025-05-07T20:31:57.6245772Z )
2025-05-07T20:31:57.6246004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6246180Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6246185Z 
2025-05-07T20:31:57.6246263Z     @given(
2025-05-07T20:31:57.6246391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6246491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6246607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6246736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6246856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6246939Z     )
2025-05-07T20:31:57.6247194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6247292Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6247382Z         self,
2025-05-07T20:31:57.6247461Z         T: int,
2025-05-07T20:31:57.6247538Z         D: int,
2025-05-07T20:31:57.6247644Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6247734Z         contiguous: bool,
2025-05-07T20:31:57.6247822Z         compiled: bool,
2025-05-07T20:31:57.6247909Z     ) -> None:
2025-05-07T20:31:57.6248005Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6248078Z     
2025-05-07T20:31:57.6248258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6248333Z     
2025-05-07T20:31:57.6248435Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6248561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6248650Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6248822Z         x0 = x[:, :D]
2025-05-07T20:31:57.6248903Z         x1 = x[:, D:]
2025-05-07T20:31:57.6248973Z     
2025-05-07T20:31:57.6249065Z         if contiguous:
2025-05-07T20:31:57.6249163Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6249259Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6249333Z     
2025-05-07T20:31:57.6249422Z         if scale_ub is not None:
2025-05-07T20:31:57.6249535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6249673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6249748Z             )
2025-05-07T20:31:57.6249833Z         else:
2025-05-07T20:31:57.6249927Z             scale_ub_tensor = None
2025-05-07T20:31:57.6249999Z     
2025-05-07T20:31:57.6250136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6250226Z             op = silu_mul_quant
2025-05-07T20:31:57.6250311Z             if compiled:
2025-05-07T20:31:57.6250417Z                 op = torch.compile(op)
2025-05-07T20:31:57.6250528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6250611Z     
2025-05-07T20:31:57.6250704Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6250831Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6250909Z     
2025-05-07T20:31:57.6251046Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6251147Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6251255Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6251379Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6251523Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6251616Z     
2025-05-07T20:31:57.6251727Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6251732Z 
2025-05-07T20:31:57.6251959Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6252093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6252208Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6252353Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6253045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6253153Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6253531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6253760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6254145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6254408Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6254795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6254976Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6255333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6255415Z     fn()
2025-05-07T20:31:57.6255830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6255913Z     self.fn.run(
2025-05-07T20:31:57.6256267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6256361Z     kernel = self.compile(
2025-05-07T20:31:57.6256756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6256938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6257069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6257153Z 
2025-05-07T20:31:57.6257368Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484bc3380>
2025-05-07T20:31:57.6258184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6258703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7dee0>}
2025-05-07T20:31:57.6259483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6259677Z context = <triton._C.libtriton.ir.context object at 0x7f4477af2bb0>
2025-05-07T20:31:57.6259682Z 
2025-05-07T20:31:57.6259862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6260132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6260244Z                            module_map=module_map)
2025-05-07T20:31:57.6260416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6260516Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6260601Z E       ^
2025-05-07T20:31:57.6260969Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6260974Z 
2025-05-07T20:31:57.6261401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6261406Z 
2025-05-07T20:31:57.6261515Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6261744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6261830Z     T=128,
2025-05-07T20:31:57.6261914Z     D=7168,
2025-05-07T20:31:57.6262015Z     scale_ub=None,
2025-05-07T20:31:57.6262118Z     contiguous=False,
2025-05-07T20:31:57.6262217Z     compiled=False,
2025-05-07T20:31:57.6262289Z )
2025-05-07T20:31:57.6262621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6262799Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6262804Z 
2025-05-07T20:31:57.6262879Z     @given(
2025-05-07T20:31:57.6263004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6263103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6263223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6263340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6263454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6263533Z     )
2025-05-07T20:31:57.6263785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6263882Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6263963Z         self,
2025-05-07T20:31:57.6264040Z         T: int,
2025-05-07T20:31:57.6264117Z         D: int,
2025-05-07T20:31:57.6264226Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6264315Z         contiguous: bool,
2025-05-07T20:31:57.6264400Z         compiled: bool,
2025-05-07T20:31:57.6264485Z     ) -> None:
2025-05-07T20:31:57.6264579Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6264658Z     
2025-05-07T20:31:57.6264829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6264902Z     
2025-05-07T20:31:57.6265002Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6265127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6265217Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6265304Z         x0 = x[:, :D]
2025-05-07T20:31:57.6265383Z         x1 = x[:, D:]
2025-05-07T20:31:57.6265459Z     
2025-05-07T20:31:57.6265634Z         if contiguous:
2025-05-07T20:31:57.6265726Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6265818Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6265897Z     
2025-05-07T20:31:57.6265993Z         if scale_ub is not None:
2025-05-07T20:31:57.6266099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6266246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6266321Z             )
2025-05-07T20:31:57.6266403Z         else:
2025-05-07T20:31:57.6266497Z             scale_ub_tensor = None
2025-05-07T20:31:57.6266569Z     
2025-05-07T20:31:57.6266706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6266796Z             op = silu_mul_quant
2025-05-07T20:31:57.6266881Z             if compiled:
2025-05-07T20:31:57.6266987Z                 op = torch.compile(op)
2025-05-07T20:31:57.6267092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6267164Z     
2025-05-07T20:31:57.6267268Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6267272Z 
2025-05-07T20:31:57.6267371Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6267509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6267617Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6267721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6268247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6268343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6268713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6268952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6269307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6269407Z     kernel = self.compile(
2025-05-07T20:31:57.6269811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6269988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6270204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6270209Z 
2025-05-07T20:31:57.6270416Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484b6ff50>
2025-05-07T20:31:57.6271227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6271792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477844cc0>}
2025-05-07T20:31:57.6272567Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6272779Z context = <triton._C.libtriton.ir.context object at 0x7f44773560b0>
2025-05-07T20:31:57.6272784Z 
2025-05-07T20:31:57.6272956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6273232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6273338Z                            module_map=module_map)
2025-05-07T20:31:57.6273503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6273608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6273683Z E       ^
2025-05-07T20:31:57.6274048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6274059Z 
2025-05-07T20:31:57.6274564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6274569Z 
2025-05-07T20:31:57.6274671Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6274911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6274987Z     T=4096,
2025-05-07T20:31:57.6275063Z     D=5120,
2025-05-07T20:31:57.6275153Z     scale_ub=1200.0,
2025-05-07T20:31:57.6275238Z     contiguous=True,
2025-05-07T20:31:57.6275322Z     compiled=False,
2025-05-07T20:31:57.6275400Z )
2025-05-07T20:31:57.6275625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6275812Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.6275816Z 
2025-05-07T20:31:57.6275893Z     @given(
2025-05-07T20:31:57.6276013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6276118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6276239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6276357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6276477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6276557Z     )
2025-05-07T20:31:57.6276808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6276908Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6276984Z         self,
2025-05-07T20:31:57.6277066Z         T: int,
2025-05-07T20:31:57.6277142Z         D: int,
2025-05-07T20:31:57.6277239Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6277334Z         contiguous: bool,
2025-05-07T20:31:57.6277419Z         compiled: bool,
2025-05-07T20:31:57.6277497Z     ) -> None:
2025-05-07T20:31:57.6277597Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6277670Z     
2025-05-07T20:31:57.6277841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6277929Z     
2025-05-07T20:31:57.6278020Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6278148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6278242Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6278322Z         x0 = x[:, :D]
2025-05-07T20:31:57.6278498Z         x1 = x[:, D:]
2025-05-07T20:31:57.6278571Z     
2025-05-07T20:31:57.6278656Z         if contiguous:
2025-05-07T20:31:57.6278755Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6278845Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6278917Z     
2025-05-07T20:31:57.6279011Z         if scale_ub is not None:
2025-05-07T20:31:57.6279116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6279252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6279333Z             )
2025-05-07T20:31:57.6279411Z         else:
2025-05-07T20:31:57.6279506Z             scale_ub_tensor = None
2025-05-07T20:31:57.6279584Z     
2025-05-07T20:31:57.6279713Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6279817Z             op = silu_mul_quant
2025-05-07T20:31:57.6279902Z             if compiled:
2025-05-07T20:31:57.6280001Z                 op = torch.compile(op)
2025-05-07T20:31:57.6280120Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6280192Z     
2025-05-07T20:31:57.6280283Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6280287Z 
2025-05-07T20:31:57.6280392Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6280525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6280625Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6280731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6281245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6281349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6281718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6282027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6282389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6282483Z     kernel = self.compile(
2025-05-07T20:31:57.6282878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6283068Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6283199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6283203Z 
2025-05-07T20:31:57.6283417Z self = <triton.compiler.compiler.ASTSource object at 0x7f44844edee0>
2025-05-07T20:31:57.6284223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6284758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477844e00>}
2025-05-07T20:31:57.6285532Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6285726Z context = <triton._C.libtriton.ir.context object at 0x7f44773f3b30>
2025-05-07T20:31:57.6285730Z 
2025-05-07T20:31:57.6285904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6286175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6286287Z                            module_map=module_map)
2025-05-07T20:31:57.6286451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6286555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6286636Z E       ^
2025-05-07T20:31:57.6287100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6287105Z 
2025-05-07T20:31:57.6287533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6287546Z 
2025-05-07T20:31:57.6287648Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6287877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6287959Z     T=1,
2025-05-07T20:31:57.6288034Z     D=5120,
2025-05-07T20:31:57.6288116Z     scale_ub=None,
2025-05-07T20:31:57.6288208Z     contiguous=True,
2025-05-07T20:31:57.6288291Z     compiled=True,
2025-05-07T20:31:57.6288363Z )
2025-05-07T20:31:57.6288597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6288766Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6288770Z 
2025-05-07T20:31:57.6288846Z     @given(
2025-05-07T20:31:57.6288976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6289076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6289196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6289314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6289428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6289507Z     )
2025-05-07T20:31:57.6289759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6289852Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6289935Z         self,
2025-05-07T20:31:57.6290011Z         T: int,
2025-05-07T20:31:57.6290086Z         D: int,
2025-05-07T20:31:57.6290191Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6290279Z         contiguous: bool,
2025-05-07T20:31:57.6290446Z         compiled: bool,
2025-05-07T20:31:57.6290530Z     ) -> None:
2025-05-07T20:31:57.6290624Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6290702Z     
2025-05-07T20:31:57.6290880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6290955Z     
2025-05-07T20:31:57.6291054Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6291180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6291267Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6291354Z         x0 = x[:, :D]
2025-05-07T20:31:57.6291434Z         x1 = x[:, D:]
2025-05-07T20:31:57.6291508Z     
2025-05-07T20:31:57.6291599Z         if contiguous:
2025-05-07T20:31:57.6291689Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6291864Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6291944Z     
2025-05-07T20:31:57.6292036Z         if scale_ub is not None:
2025-05-07T20:31:57.6292147Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6292290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6292366Z             )
2025-05-07T20:31:57.6292453Z         else:
2025-05-07T20:31:57.6292548Z             scale_ub_tensor = None
2025-05-07T20:31:57.6292626Z     
2025-05-07T20:31:57.6292762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6292852Z             op = silu_mul_quant
2025-05-07T20:31:57.6292938Z             if compiled:
2025-05-07T20:31:57.6293042Z                 op = torch.compile(op)
2025-05-07T20:31:57.6293148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6293219Z     
2025-05-07T20:31:57.6293315Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6293439Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6293515Z     
2025-05-07T20:31:57.6293653Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6293754Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6293859Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6293988Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6294131Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6294212Z     
2025-05-07T20:31:57.6294398Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6294403Z 
2025-05-07T20:31:57.6294502Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6294648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6294754Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6294896Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6295481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6295582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6295959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6296192Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6296584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6296846Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6297235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6297411Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6297767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6297843Z     fn()
2025-05-07T20:31:57.6298267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6298350Z     self.fn.run(
2025-05-07T20:31:57.6298784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6298877Z     kernel = self.compile(
2025-05-07T20:31:57.6299278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6299462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6299594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6299598Z 
2025-05-07T20:31:57.6299806Z self = <triton.compiler.compiler.ASTSource object at 0x7f44775a9610>
2025-05-07T20:31:57.6300632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6301151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477846fc0>}
2025-05-07T20:31:57.6301947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6302140Z context = <triton._C.libtriton.ir.context object at 0x7f44773c5bb0>
2025-05-07T20:31:57.6302144Z 
2025-05-07T20:31:57.6302317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6302588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6302695Z                            module_map=module_map)
2025-05-07T20:31:57.6302869Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6302970Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6303046Z E       ^
2025-05-07T20:31:57.6303425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6303430Z 
2025-05-07T20:31:57.6303936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6303941Z 
2025-05-07T20:31:57.6304057Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6304287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6304362Z     T=2048,
2025-05-07T20:31:57.6304447Z     D=5120,
2025-05-07T20:31:57.6304528Z     scale_ub=None,
2025-05-07T20:31:57.6304628Z     contiguous=True,
2025-05-07T20:31:57.6304713Z     compiled=True,
2025-05-07T20:31:57.6304784Z )
2025-05-07T20:31:57.6305020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6305198Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6305203Z 
2025-05-07T20:31:57.6305290Z     @given(
2025-05-07T20:31:57.6305411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6305512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6305637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6305760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6305876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6305956Z     )
2025-05-07T20:31:57.6306491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6306618Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6306703Z         self,
2025-05-07T20:31:57.6306781Z         T: int,
2025-05-07T20:31:57.6306861Z         D: int,
2025-05-07T20:31:57.6306959Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6307048Z         contiguous: bool,
2025-05-07T20:31:57.6307140Z         compiled: bool,
2025-05-07T20:31:57.6307219Z     ) -> None:
2025-05-07T20:31:57.6307315Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6307585Z     
2025-05-07T20:31:57.6307757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6307831Z     
2025-05-07T20:31:57.6307930Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6308062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6308152Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6308239Z         x0 = x[:, :D]
2025-05-07T20:31:57.6308322Z         x1 = x[:, D:]
2025-05-07T20:31:57.6308405Z     
2025-05-07T20:31:57.6308490Z         if contiguous:
2025-05-07T20:31:57.6308585Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6308679Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6308751Z     
2025-05-07T20:31:57.6308841Z         if scale_ub is not None:
2025-05-07T20:31:57.6308951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6309088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6309162Z             )
2025-05-07T20:31:57.6309249Z         else:
2025-05-07T20:31:57.6309342Z             scale_ub_tensor = None
2025-05-07T20:31:57.6309412Z     
2025-05-07T20:31:57.6309547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6309642Z             op = silu_mul_quant
2025-05-07T20:31:57.6309727Z             if compiled:
2025-05-07T20:31:57.6309831Z                 op = torch.compile(op)
2025-05-07T20:31:57.6309935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6310012Z     
2025-05-07T20:31:57.6310103Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6310225Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6310302Z     
2025-05-07T20:31:57.6310438Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6310542Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6310647Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6310768Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6310910Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6310995Z     
2025-05-07T20:31:57.6311095Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6311100Z 
2025-05-07T20:31:57.6311204Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6311470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6311577Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6311720Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6312300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6312402Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6312780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6313006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6313397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6313667Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6314059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6314235Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6314589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6314672Z     fn()
2025-05-07T20:31:57.6315090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6315177Z     self.fn.run(
2025-05-07T20:31:57.6315534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6315733Z     kernel = self.compile(
2025-05-07T20:31:57.6316134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6316329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6316459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6316464Z 
2025-05-07T20:31:57.6316680Z self = <triton.compiler.compiler.ASTSource object at 0x7f44775a9d90>
2025-05-07T20:31:57.6317483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6318001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44777f54e0>}
2025-05-07T20:31:57.6318788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6318988Z context = <triton._C.libtriton.ir.context object at 0x7f447727e2b0>
2025-05-07T20:31:57.6318993Z 
2025-05-07T20:31:57.6319167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6319438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6319547Z                            module_map=module_map)
2025-05-07T20:31:57.6319718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6319820Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6319901Z E       ^
2025-05-07T20:31:57.6320271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6320280Z 
2025-05-07T20:31:57.6320708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6320713Z 
2025-05-07T20:31:57.6320823Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6321132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6321216Z     T=128,
2025-05-07T20:31:57.6321297Z     D=5120,
2025-05-07T20:31:57.6321380Z     scale_ub=None,
2025-05-07T20:31:57.6321471Z     contiguous=True,
2025-05-07T20:31:57.6321556Z     compiled=True,
2025-05-07T20:31:57.6321629Z )
2025-05-07T20:31:57.6321861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6322032Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6322037Z 
2025-05-07T20:31:57.6322115Z     @given(
2025-05-07T20:31:57.6322241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6322341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6322467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6322586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6322702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6322789Z     )
2025-05-07T20:31:57.6323042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6323137Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6323223Z         self,
2025-05-07T20:31:57.6323298Z         T: int,
2025-05-07T20:31:57.6323374Z         D: int,
2025-05-07T20:31:57.6323478Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6323568Z         contiguous: bool,
2025-05-07T20:31:57.6323654Z         compiled: bool,
2025-05-07T20:31:57.6323737Z     ) -> None:
2025-05-07T20:31:57.6323832Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6323914Z     
2025-05-07T20:31:57.6324084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6324239Z     
2025-05-07T20:31:57.6324339Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6324463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6324552Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6324642Z         x0 = x[:, :D]
2025-05-07T20:31:57.6324721Z         x1 = x[:, D:]
2025-05-07T20:31:57.6324793Z     
2025-05-07T20:31:57.6324882Z         if contiguous:
2025-05-07T20:31:57.6324974Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6325063Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6325142Z     
2025-05-07T20:31:57.6325231Z         if scale_ub is not None:
2025-05-07T20:31:57.6325337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6325478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6325552Z             )
2025-05-07T20:31:57.6325634Z         else:
2025-05-07T20:31:57.6325727Z             scale_ub_tensor = None
2025-05-07T20:31:57.6325800Z     
2025-05-07T20:31:57.6325936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6326032Z             op = silu_mul_quant
2025-05-07T20:31:57.6326116Z             if compiled:
2025-05-07T20:31:57.6326222Z                 op = torch.compile(op)
2025-05-07T20:31:57.6326330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6326402Z     
2025-05-07T20:31:57.6326497Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6326618Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6326691Z     
2025-05-07T20:31:57.6326833Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6326934Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6327040Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6327163Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6327305Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6327381Z     
2025-05-07T20:31:57.6327482Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6327490Z 
2025-05-07T20:31:57.6327591Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6327730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6327912Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6328055Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6328632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6328733Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6329108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6329336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6329717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6329991Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6330378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6330557Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6330912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6330988Z     fn()
2025-05-07T20:31:57.6331409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6331491Z     self.fn.run(
2025-05-07T20:31:57.6331907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6332001Z     kernel = self.compile(
2025-05-07T20:31:57.6332394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6332660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6332791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6332795Z 
2025-05-07T20:31:57.6333007Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477402810>
2025-05-07T20:31:57.6333821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6334341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476b59ee0>}
2025-05-07T20:31:57.6335125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6335325Z context = <triton._C.libtriton.ir.context object at 0x7f447713e6f0>
2025-05-07T20:31:57.6335329Z 
2025-05-07T20:31:57.6335508Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6335779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6335885Z                            module_map=module_map)
2025-05-07T20:31:57.6336055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6336158Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6336237Z E       ^
2025-05-07T20:31:57.6336610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6336615Z 
2025-05-07T20:31:57.6337040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6337050Z 
2025-05-07T20:31:57.6337161Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6337390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6337466Z     T=4096,
2025-05-07T20:31:57.6337636Z     D=5120,
2025-05-07T20:31:57.6337720Z     scale_ub=None,
2025-05-07T20:31:57.6337805Z     contiguous=True,
2025-05-07T20:31:57.6337896Z     compiled=True,
2025-05-07T20:31:57.6337970Z )
2025-05-07T20:31:57.6338201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6338377Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6338381Z 
2025-05-07T20:31:57.6338460Z     @given(
2025-05-07T20:31:57.6338587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6338686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6338801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6338925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6339045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6339121Z     )
2025-05-07T20:31:57.6339378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6339477Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6339562Z         self,
2025-05-07T20:31:57.6339639Z         T: int,
2025-05-07T20:31:57.6339717Z         D: int,
2025-05-07T20:31:57.6339820Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6339909Z         contiguous: bool,
2025-05-07T20:31:57.6339993Z         compiled: bool,
2025-05-07T20:31:57.6340078Z     ) -> None:
2025-05-07T20:31:57.6340174Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6340248Z     
2025-05-07T20:31:57.6340424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6340498Z     
2025-05-07T20:31:57.6340592Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6340723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6340894Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6340980Z         x0 = x[:, :D]
2025-05-07T20:31:57.6341060Z         x1 = x[:, D:]
2025-05-07T20:31:57.6341133Z     
2025-05-07T20:31:57.6341220Z         if contiguous:
2025-05-07T20:31:57.6341317Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6341406Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6341484Z     
2025-05-07T20:31:57.6341572Z         if scale_ub is not None:
2025-05-07T20:31:57.6341676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6341818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6341892Z             )
2025-05-07T20:31:57.6341970Z         else:
2025-05-07T20:31:57.6342068Z             scale_ub_tensor = None
2025-05-07T20:31:57.6342140Z     
2025-05-07T20:31:57.6342271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6342367Z             op = silu_mul_quant
2025-05-07T20:31:57.6342451Z             if compiled:
2025-05-07T20:31:57.6342562Z                 op = torch.compile(op)
2025-05-07T20:31:57.6342667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6342740Z     
2025-05-07T20:31:57.6342835Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6342961Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6343033Z     
2025-05-07T20:31:57.6343175Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6343277Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6343377Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6343505Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6343647Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6343724Z     
2025-05-07T20:31:57.6343823Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6343827Z 
2025-05-07T20:31:57.6343924Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6344063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6344181Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6344317Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6344985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6345087Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6345462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6345688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6346068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6346335Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6346723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6346897Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6347262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6347341Z     fn()
2025-05-07T20:31:57.6347759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6347842Z     self.fn.run(
2025-05-07T20:31:57.6348190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6348289Z     kernel = self.compile(
2025-05-07T20:31:57.6348681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6348861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6349003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6349107Z 
2025-05-07T20:31:57.6349316Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477442f30>
2025-05-07T20:31:57.6350139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6350654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44768a8ea0>}
2025-05-07T20:31:57.6351436Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6351632Z context = <triton._C.libtriton.ir.context object at 0x7f447634bd70>
2025-05-07T20:31:57.6351642Z 
2025-05-07T20:31:57.6351811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6352089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6352201Z                            module_map=module_map)
2025-05-07T20:31:57.6352373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6352476Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6352553Z E       ^
2025-05-07T20:31:57.6352930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6352935Z 
2025-05-07T20:31:57.6353367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6353371Z 
2025-05-07T20:31:57.6353485Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6353716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6353801Z     T=16384,
2025-05-07T20:31:57.6353882Z     D=5120,
2025-05-07T20:31:57.6353965Z     scale_ub=None,
2025-05-07T20:31:57.6354053Z     contiguous=True,
2025-05-07T20:31:57.6354224Z     compiled=True,
2025-05-07T20:31:57.6354298Z )
2025-05-07T20:31:57.6354523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6354709Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6354713Z 
2025-05-07T20:31:57.6354790Z     @given(
2025-05-07T20:31:57.6354913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6355020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6355138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6355262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6355377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6355451Z     )
2025-05-07T20:31:57.6355714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6355808Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6355886Z         self,
2025-05-07T20:31:57.6355969Z         T: int,
2025-05-07T20:31:57.6356049Z         D: int,
2025-05-07T20:31:57.6356147Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6356246Z         contiguous: bool,
2025-05-07T20:31:57.6356330Z         compiled: bool,
2025-05-07T20:31:57.6356415Z     ) -> None:
2025-05-07T20:31:57.6356509Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6356587Z     
2025-05-07T20:31:57.6356763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6356836Z     
2025-05-07T20:31:57.6356929Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6357060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6357148Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6357226Z         x0 = x[:, :D]
2025-05-07T20:31:57.6357315Z         x1 = x[:, D:]
2025-05-07T20:31:57.6357470Z     
2025-05-07T20:31:57.6357555Z         if contiguous:
2025-05-07T20:31:57.6357653Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6357742Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6357814Z     
2025-05-07T20:31:57.6357915Z         if scale_ub is not None:
2025-05-07T20:31:57.6358020Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6358162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6358239Z             )
2025-05-07T20:31:57.6358314Z         else:
2025-05-07T20:31:57.6358412Z             scale_ub_tensor = None
2025-05-07T20:31:57.6358484Z     
2025-05-07T20:31:57.6358614Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6358709Z             op = silu_mul_quant
2025-05-07T20:31:57.6358793Z             if compiled:
2025-05-07T20:31:57.6358892Z                 op = torch.compile(op)
2025-05-07T20:31:57.6359004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6359081Z     
2025-05-07T20:31:57.6359170Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6359301Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6359373Z     
2025-05-07T20:31:57.6359523Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6359625Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6359725Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6359854Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6359996Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6360068Z     
2025-05-07T20:31:57.6360173Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6360177Z 
2025-05-07T20:31:57.6360275Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6360408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6360519Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6360653Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6361241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6361423Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6361796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6362029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6362409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6362678Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6363064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6363233Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6363598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6363676Z     fn()
2025-05-07T20:31:57.6364096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6364185Z     self.fn.run(
2025-05-07T20:31:57.6364533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6364633Z     kernel = self.compile(
2025-05-07T20:31:57.6365025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6365203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6365341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6365345Z 
2025-05-07T20:31:57.6365551Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476e52c60>
2025-05-07T20:31:57.6366450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6366969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44768a9bc0>}
2025-05-07T20:31:57.6367752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6367953Z context = <triton._C.libtriton.ir.context object at 0x7f4475d61130>
2025-05-07T20:31:57.6367957Z 
2025-05-07T20:31:57.6374124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6374432Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6374543Z                            module_map=module_map)
2025-05-07T20:31:57.6374722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6374821Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6374906Z E       ^
2025-05-07T20:31:57.6375273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6375279Z 
2025-05-07T20:31:57.6375715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6375720Z 
2025-05-07T20:31:57.6375824Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6376051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6376138Z     T=1,
2025-05-07T20:31:57.6376215Z     D=5120,
2025-05-07T20:31:57.6376299Z     scale_ub=1200.0,
2025-05-07T20:31:57.6376397Z     contiguous=True,
2025-05-07T20:31:57.6376483Z     compiled=True,
2025-05-07T20:31:57.6376558Z )
2025-05-07T20:31:57.6376790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6377074Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6377079Z 
2025-05-07T20:31:57.6377166Z     @given(
2025-05-07T20:31:57.6377291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6377391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6377517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6377636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6377751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6377836Z     )
2025-05-07T20:31:57.6378091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6378187Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6378279Z         self,
2025-05-07T20:31:57.6378357Z         T: int,
2025-05-07T20:31:57.6378441Z         D: int,
2025-05-07T20:31:57.6378540Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6378631Z         contiguous: bool,
2025-05-07T20:31:57.6378730Z         compiled: bool,
2025-05-07T20:31:57.6378810Z     ) -> None:
2025-05-07T20:31:57.6378906Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6378989Z     
2025-05-07T20:31:57.6379165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6379246Z     
2025-05-07T20:31:57.6379347Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6379474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6379565Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6379657Z         x0 = x[:, :D]
2025-05-07T20:31:57.6379738Z         x1 = x[:, D:]
2025-05-07T20:31:57.6379812Z     
2025-05-07T20:31:57.6379901Z         if contiguous:
2025-05-07T20:31:57.6379994Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6380246Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6380318Z     
2025-05-07T20:31:57.6380409Z         if scale_ub is not None:
2025-05-07T20:31:57.6380523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6380668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6380750Z             )
2025-05-07T20:31:57.6380827Z         else:
2025-05-07T20:31:57.6380923Z             scale_ub_tensor = None
2025-05-07T20:31:57.6381005Z     
2025-05-07T20:31:57.6381135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6381226Z             op = silu_mul_quant
2025-05-07T20:31:57.6381319Z             if compiled:
2025-05-07T20:31:57.6381419Z                 op = torch.compile(op)
2025-05-07T20:31:57.6381528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6381609Z     
2025-05-07T20:31:57.6381699Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6381704Z 
2025-05-07T20:31:57.6381804Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6381947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6382051Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6382158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6382547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6382640Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6383162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6383260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6383628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6383864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6384214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6384323Z     kernel = self.compile(
2025-05-07T20:31:57.6384717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6385001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6385142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6385146Z 
2025-05-07T20:31:57.6385355Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477441eb0>
2025-05-07T20:31:57.6386172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6386689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476765620>}
2025-05-07T20:31:57.6387474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6387675Z context = <triton._C.libtriton.ir.context object at 0x7f4475d86bf0>
2025-05-07T20:31:57.6387680Z 
2025-05-07T20:31:57.6387848Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6388125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6388231Z                            module_map=module_map)
2025-05-07T20:31:57.6388396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6388500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6388576Z E       ^
2025-05-07T20:31:57.6388953Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6389035Z 
2025-05-07T20:31:57.6389464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6389469Z 
2025-05-07T20:31:57.6389578Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6389815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6389893Z     T=1,
2025-05-07T20:31:57.6389967Z     D=5120,
2025-05-07T20:31:57.6390057Z     scale_ub=None,
2025-05-07T20:31:57.6390143Z     contiguous=False,
2025-05-07T20:31:57.6390233Z     compiled=True,
2025-05-07T20:31:57.6390305Z )
2025-05-07T20:31:57.6390530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6390705Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6390710Z 
2025-05-07T20:31:57.6390787Z     @given(
2025-05-07T20:31:57.6390907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6391019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6391134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6391251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6391380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6391453Z     )
2025-05-07T20:31:57.6391716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6391810Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6391885Z         self,
2025-05-07T20:31:57.6391967Z         T: int,
2025-05-07T20:31:57.6392044Z         D: int,
2025-05-07T20:31:57.6392142Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6392236Z         contiguous: bool,
2025-05-07T20:31:57.6392321Z         compiled: bool,
2025-05-07T20:31:57.6392399Z     ) -> None:
2025-05-07T20:31:57.6392502Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6392574Z     
2025-05-07T20:31:57.6392743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6392827Z     
2025-05-07T20:31:57.6392919Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6393052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6393226Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6393309Z         x0 = x[:, :D]
2025-05-07T20:31:57.6393395Z         x1 = x[:, D:]
2025-05-07T20:31:57.6393467Z     
2025-05-07T20:31:57.6393552Z         if contiguous:
2025-05-07T20:31:57.6393648Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6393736Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6393809Z     
2025-05-07T20:31:57.6393906Z         if scale_ub is not None:
2025-05-07T20:31:57.6394013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6394150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6394230Z             )
2025-05-07T20:31:57.6394307Z         else:
2025-05-07T20:31:57.6394407Z             scale_ub_tensor = None
2025-05-07T20:31:57.6394484Z     
2025-05-07T20:31:57.6394615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6394712Z             op = silu_mul_quant
2025-05-07T20:31:57.6394796Z             if compiled:
2025-05-07T20:31:57.6394902Z                 op = torch.compile(op)
2025-05-07T20:31:57.6395015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6395087Z     
2025-05-07T20:31:57.6395177Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6395308Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6395380Z     
2025-05-07T20:31:57.6395517Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6395627Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6395727Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6395860Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6396003Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6396074Z     
2025-05-07T20:31:57.6396268Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6396272Z 
2025-05-07T20:31:57.6396371Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6396507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6396620Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6396756Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6397342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6397443Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6397814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6398047Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6398431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6398701Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6399099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6399268Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6399627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6399705Z     fn()
2025-05-07T20:31:57.6400123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6400214Z     self.fn.run(
2025-05-07T20:31:57.6400562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6400658Z     kernel = self.compile(
2025-05-07T20:31:57.6401057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6401243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6401459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6401463Z 
2025-05-07T20:31:57.6401673Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476763d10>
2025-05-07T20:31:57.6402530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6403058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476766e80>}
2025-05-07T20:31:57.6403830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6404036Z context = <triton._C.libtriton.ir.context object at 0x7f44752e0cb0>
2025-05-07T20:31:57.6404040Z 
2025-05-07T20:31:57.6404211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6404487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6404593Z                            module_map=module_map)
2025-05-07T20:31:57.6404757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6404866Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6404942Z E       ^
2025-05-07T20:31:57.6405307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6405312Z 
2025-05-07T20:31:57.6405744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6405825Z 
2025-05-07T20:31:57.6405929Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6406419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6406546Z     T=1,
2025-05-07T20:31:57.6406652Z     D=5120,
2025-05-07T20:31:57.6406775Z     scale_ub=None,
2025-05-07T20:31:57.6406869Z     contiguous=True,
2025-05-07T20:31:57.6406955Z     compiled=False,
2025-05-07T20:31:57.6407032Z )
2025-05-07T20:31:57.6407257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6407424Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.6407435Z 
2025-05-07T20:31:57.6407510Z     @given(
2025-05-07T20:31:57.6407633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6407738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6407853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6407978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6408098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6408171Z     )
2025-05-07T20:31:57.6408427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6408526Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6408601Z         self,
2025-05-07T20:31:57.6408678Z         T: int,
2025-05-07T20:31:57.6408763Z         D: int,
2025-05-07T20:31:57.6408861Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6408954Z         contiguous: bool,
2025-05-07T20:31:57.6409038Z         compiled: bool,
2025-05-07T20:31:57.6409117Z     ) -> None:
2025-05-07T20:31:57.6409218Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6409292Z     
2025-05-07T20:31:57.6409462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6409541Z     
2025-05-07T20:31:57.6409634Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6409764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6409861Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6409941Z         x0 = x[:, :D]
2025-05-07T20:31:57.6410019Z         x1 = x[:, D:]
2025-05-07T20:31:57.6410304Z     
2025-05-07T20:31:57.6410393Z         if contiguous:
2025-05-07T20:31:57.6410494Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6410582Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6410654Z     
2025-05-07T20:31:57.6410750Z         if scale_ub is not None:
2025-05-07T20:31:57.6410854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6410991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6411071Z             )
2025-05-07T20:31:57.6411147Z         else:
2025-05-07T20:31:57.6411240Z             scale_ub_tensor = None
2025-05-07T20:31:57.6411322Z     
2025-05-07T20:31:57.6411451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6411540Z             op = silu_mul_quant
2025-05-07T20:31:57.6411637Z             if compiled:
2025-05-07T20:31:57.6411736Z                 op = torch.compile(op)
2025-05-07T20:31:57.6411960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6412044Z     
2025-05-07T20:31:57.6412140Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6412144Z 
2025-05-07T20:31:57.6412263Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6412413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6412528Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6412636Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6413151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6413248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6413626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6413985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6414341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6414442Z     kernel = self.compile(
2025-05-07T20:31:57.6414838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6415023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6415153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6415157Z 
2025-05-07T20:31:57.6415372Z self = <triton.compiler.compiler.ASTSource object at 0x7f447641e240>
2025-05-07T20:31:57.6416177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6416698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476765ee0>}
2025-05-07T20:31:57.6417484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6417678Z context = <triton._C.libtriton.ir.context object at 0x7f447531fd70>
2025-05-07T20:31:57.6417683Z 
2025-05-07T20:31:57.6417859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6418130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6418237Z                            module_map=module_map)
2025-05-07T20:31:57.6418408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6418509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6418597Z E       ^
2025-05-07T20:31:57.6418962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6418966Z 
2025-05-07T20:31:57.6419498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6419503Z 
2025-05-07T20:31:57.6419612Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6419841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6419926Z     T=128,
2025-05-07T20:31:57.6420003Z     D=5120,
2025-05-07T20:31:57.6420085Z     scale_ub=None,
2025-05-07T20:31:57.6420176Z     contiguous=False,
2025-05-07T20:31:57.6420259Z     compiled=True,
2025-05-07T20:31:57.6420331Z )
2025-05-07T20:31:57.6420562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6420735Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6420745Z 
2025-05-07T20:31:57.6420821Z     @given(
2025-05-07T20:31:57.6420946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6421051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6421166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6421288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6421402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6421500Z     )
2025-05-07T20:31:57.6421782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6421875Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6421956Z         self,
2025-05-07T20:31:57.6422033Z         T: int,
2025-05-07T20:31:57.6422107Z         D: int,
2025-05-07T20:31:57.6422211Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6422299Z         contiguous: bool,
2025-05-07T20:31:57.6422385Z         compiled: bool,
2025-05-07T20:31:57.6422551Z     ) -> None:
2025-05-07T20:31:57.6422647Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6422720Z     
2025-05-07T20:31:57.6422901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6422979Z     
2025-05-07T20:31:57.6423078Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6423202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6423291Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6423377Z         x0 = x[:, :D]
2025-05-07T20:31:57.6423455Z         x1 = x[:, D:]
2025-05-07T20:31:57.6423527Z     
2025-05-07T20:31:57.6423615Z         if contiguous:
2025-05-07T20:31:57.6423705Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6423795Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6423874Z     
2025-05-07T20:31:57.6423964Z         if scale_ub is not None:
2025-05-07T20:31:57.6424067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6424212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6424294Z             )
2025-05-07T20:31:57.6424374Z         else:
2025-05-07T20:31:57.6424467Z             scale_ub_tensor = None
2025-05-07T20:31:57.6424538Z     
2025-05-07T20:31:57.6424678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6424768Z             op = silu_mul_quant
2025-05-07T20:31:57.6424853Z             if compiled:
2025-05-07T20:31:57.6424957Z                 op = torch.compile(op)
2025-05-07T20:31:57.6425062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6425133Z     
2025-05-07T20:31:57.6425233Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6425238Z 
2025-05-07T20:31:57.6425335Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6425465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6425572Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6425672Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6426056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6426155Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6426745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6426851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6427219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6427452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6427803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6427896Z     kernel = self.compile(
2025-05-07T20:31:57.6428296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6428474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6428610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6428614Z 
2025-05-07T20:31:57.6428833Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475efd640>
2025-05-07T20:31:57.6429634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6430157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44767a5da0>}
2025-05-07T20:31:57.6430929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6431208Z context = <triton._C.libtriton.ir.context object at 0x7f4475505730>
2025-05-07T20:31:57.6431212Z 
2025-05-07T20:31:57.6431380Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6431657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6431770Z                            module_map=module_map)
2025-05-07T20:31:57.6431937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6432035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6432122Z E       ^
2025-05-07T20:31:57.6432485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6432490Z 
2025-05-07T20:31:57.6432926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6432930Z 
2025-05-07T20:31:57.6433034Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6433269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6433353Z     T=128,
2025-05-07T20:31:57.6433428Z     D=7168,
2025-05-07T20:31:57.6433512Z     scale_ub=1200.0,
2025-05-07T20:31:57.6433611Z     contiguous=False,
2025-05-07T20:31:57.6433697Z     compiled=False,
2025-05-07T20:31:57.6433775Z )
2025-05-07T20:31:57.6433997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6434175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6434180Z 
2025-05-07T20:31:57.6434262Z     @given(
2025-05-07T20:31:57.6434385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6434484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6434605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6434722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6434839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6434921Z     )
2025-05-07T20:31:57.6435171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6435269Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6435427Z         self,
2025-05-07T20:31:57.6435506Z         T: int,
2025-05-07T20:31:57.6435597Z         D: int,
2025-05-07T20:31:57.6435695Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6435789Z         contiguous: bool,
2025-05-07T20:31:57.6435876Z         compiled: bool,
2025-05-07T20:31:57.6435954Z     ) -> None:
2025-05-07T20:31:57.6436055Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6436128Z     
2025-05-07T20:31:57.6436299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6436382Z     
2025-05-07T20:31:57.6436477Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6436601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6436698Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6436783Z         x0 = x[:, :D]
2025-05-07T20:31:57.6436870Z         x1 = x[:, D:]
2025-05-07T20:31:57.6436942Z     
2025-05-07T20:31:57.6437029Z         if contiguous:
2025-05-07T20:31:57.6437126Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6437221Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6437294Z     
2025-05-07T20:31:57.6437393Z         if scale_ub is not None:
2025-05-07T20:31:57.6437499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6437635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6437721Z             )
2025-05-07T20:31:57.6437797Z         else:
2025-05-07T20:31:57.6437892Z             scale_ub_tensor = None
2025-05-07T20:31:57.6437970Z     
2025-05-07T20:31:57.6438102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6438192Z             op = silu_mul_quant
2025-05-07T20:31:57.6438285Z             if compiled:
2025-05-07T20:31:57.6438387Z                 op = torch.compile(op)
2025-05-07T20:31:57.6438583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6438657Z     
2025-05-07T20:31:57.6438750Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6438754Z 
2025-05-07T20:31:57.6438859Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6438997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6439099Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6439203Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6439714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6439818Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6440185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6440411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6440765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6440864Z     kernel = self.compile(
2025-05-07T20:31:57.6441261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6441444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6441574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6441578Z 
2025-05-07T20:31:57.6441790Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475efec60>
2025-05-07T20:31:57.6442638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6443155Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7ab60>}
2025-05-07T20:31:57.6444019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6444212Z context = <triton._C.libtriton.ir.context object at 0x7f44751ab730>
2025-05-07T20:31:57.6444217Z 
2025-05-07T20:31:57.6444390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6444661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6444778Z                            module_map=module_map)
2025-05-07T20:31:57.6444941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6445037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6445125Z E       ^
2025-05-07T20:31:57.6445490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6445499Z 
2025-05-07T20:31:57.6445930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6445934Z 
2025-05-07T20:31:57.6446044Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6446271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6446354Z     T=128,
2025-05-07T20:31:57.6446431Z     D=5120,
2025-05-07T20:31:57.6446513Z     scale_ub=None,
2025-05-07T20:31:57.6446605Z     contiguous=False,
2025-05-07T20:31:57.6446691Z     compiled=False,
2025-05-07T20:31:57.6446762Z )
2025-05-07T20:31:57.6446993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6447167Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6447171Z 
2025-05-07T20:31:57.6447246Z     @given(
2025-05-07T20:31:57.6447480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6447579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6447703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6447826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6447940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6448020Z     )
2025-05-07T20:31:57.6448270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6448362Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6448444Z         self,
2025-05-07T20:31:57.6448519Z         T: int,
2025-05-07T20:31:57.6448595Z         D: int,
2025-05-07T20:31:57.6448704Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6448793Z         contiguous: bool,
2025-05-07T20:31:57.6448877Z         compiled: bool,
2025-05-07T20:31:57.6448962Z     ) -> None:
2025-05-07T20:31:57.6449056Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6449136Z     
2025-05-07T20:31:57.6449315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6449393Z     
2025-05-07T20:31:57.6449491Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6449624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6449712Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6449799Z         x0 = x[:, :D]
2025-05-07T20:31:57.6449880Z         x1 = x[:, D:]
2025-05-07T20:31:57.6449951Z     
2025-05-07T20:31:57.6450042Z         if contiguous:
2025-05-07T20:31:57.6450134Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6450223Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6450301Z     
2025-05-07T20:31:57.6450392Z         if scale_ub is not None:
2025-05-07T20:31:57.6450504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6450641Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6450717Z             )
2025-05-07T20:31:57.6450799Z         else:
2025-05-07T20:31:57.6450900Z             scale_ub_tensor = None
2025-05-07T20:31:57.6450971Z     
2025-05-07T20:31:57.6451106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6451196Z             op = silu_mul_quant
2025-05-07T20:31:57.6451365Z             if compiled:
2025-05-07T20:31:57.6451475Z                 op = torch.compile(op)
2025-05-07T20:31:57.6451585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6451675Z     
2025-05-07T20:31:57.6451857Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6451863Z 
2025-05-07T20:31:57.6451964Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6452102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6452204Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6452303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6452819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6452920Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6453288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6453528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6453878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6453981Z     kernel = self.compile(
2025-05-07T20:31:57.6454377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6454558Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6454694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6454699Z 
2025-05-07T20:31:57.6454904Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea67b0>
2025-05-07T20:31:57.6455715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6456320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7b060>}
2025-05-07T20:31:57.6457091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6457287Z context = <triton._C.libtriton.ir.context object at 0x7f44751cc530>
2025-05-07T20:31:57.6457292Z 
2025-05-07T20:31:57.6457460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6457736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6457845Z                            module_map=module_map)
2025-05-07T20:31:57.6458009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6458113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6458194Z E       ^
2025-05-07T20:31:57.6458565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6458570Z 
2025-05-07T20:31:57.6458996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6459000Z 
2025-05-07T20:31:57.6459103Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6459337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6459414Z     T=128,
2025-05-07T20:31:57.6459489Z     D=5120,
2025-05-07T20:31:57.6459579Z     scale_ub=1200.0,
2025-05-07T20:31:57.6459664Z     contiguous=True,
2025-05-07T20:31:57.6459754Z     compiled=False,
2025-05-07T20:31:57.6459832Z )
2025-05-07T20:31:57.6460054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6460235Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.6460318Z 
2025-05-07T20:31:57.6460395Z     @given(
2025-05-07T20:31:57.6460515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6460622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6460737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6460855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6460975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6461048Z     )
2025-05-07T20:31:57.6461304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6461412Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6461497Z         self,
2025-05-07T20:31:57.6461596Z         T: int,
2025-05-07T20:31:57.6461688Z         D: int,
2025-05-07T20:31:57.6461786Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6461881Z         contiguous: bool,
2025-05-07T20:31:57.6461967Z         compiled: bool,
2025-05-07T20:31:57.6462045Z     ) -> None:
2025-05-07T20:31:57.6462151Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6462223Z     
2025-05-07T20:31:57.6462396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6462476Z     
2025-05-07T20:31:57.6462570Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6462699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6462791Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6462870Z         x0 = x[:, :D]
2025-05-07T20:31:57.6462954Z         x1 = x[:, D:]
2025-05-07T20:31:57.6463027Z     
2025-05-07T20:31:57.6463110Z         if contiguous:
2025-05-07T20:31:57.6463208Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6463297Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6463452Z     
2025-05-07T20:31:57.6463550Z         if scale_ub is not None:
2025-05-07T20:31:57.6463658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6463794Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6463882Z             )
2025-05-07T20:31:57.6463958Z         else:
2025-05-07T20:31:57.6464051Z             scale_ub_tensor = None
2025-05-07T20:31:57.6464129Z     
2025-05-07T20:31:57.6464258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6464355Z             op = silu_mul_quant
2025-05-07T20:31:57.6464439Z             if compiled:
2025-05-07T20:31:57.6464538Z                 op = torch.compile(op)
2025-05-07T20:31:57.6464652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6464724Z     
2025-05-07T20:31:57.6464815Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6464819Z 
2025-05-07T20:31:57.6464924Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6465056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6465164Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6465269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6465787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6465891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6466258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6466484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6466840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6466935Z     kernel = self.compile(
2025-05-07T20:31:57.6467337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6467515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6467651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6467656Z 
2025-05-07T20:31:57.6467951Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea74a0>
2025-05-07T20:31:57.6468756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6469279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e78180>}
2025-05-07T20:31:57.6470050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6470247Z context = <triton._C.libtriton.ir.context object at 0x7f4475237f30>
2025-05-07T20:31:57.6470251Z 
2025-05-07T20:31:57.6470427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6470702Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6470815Z                            module_map=module_map)
2025-05-07T20:31:57.6470981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6471081Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6471166Z E       ^
2025-05-07T20:31:57.6471530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6471534Z 
2025-05-07T20:31:57.6471961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6471973Z 
2025-05-07T20:31:57.6472094Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6472430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6472513Z     T=1,
2025-05-07T20:31:57.6472589Z     D=7168,
2025-05-07T20:31:57.6472675Z     scale_ub=1200.0,
2025-05-07T20:31:57.6472771Z     contiguous=True,
2025-05-07T20:31:57.6472854Z     compiled=True,
2025-05-07T20:31:57.6472926Z )
2025-05-07T20:31:57.6473157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6473324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6473329Z 
2025-05-07T20:31:57.6473410Z     @given(
2025-05-07T20:31:57.6473530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6473630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6473756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6473873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6473987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6474069Z     )
2025-05-07T20:31:57.6474321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6474412Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6474502Z         self,
2025-05-07T20:31:57.6474579Z         T: int,
2025-05-07T20:31:57.6474654Z         D: int,
2025-05-07T20:31:57.6474762Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6474853Z         contiguous: bool,
2025-05-07T20:31:57.6474943Z         compiled: bool,
2025-05-07T20:31:57.6475020Z     ) -> None:
2025-05-07T20:31:57.6475115Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6475192Z     
2025-05-07T20:31:57.6475363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6475435Z     
2025-05-07T20:31:57.6475532Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6475658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6475745Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6475837Z         x0 = x[:, :D]
2025-05-07T20:31:57.6475917Z         x1 = x[:, D:]
2025-05-07T20:31:57.6475990Z     
2025-05-07T20:31:57.6476079Z         if contiguous:
2025-05-07T20:31:57.6476171Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6476358Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6476432Z     
2025-05-07T20:31:57.6476523Z         if scale_ub is not None:
2025-05-07T20:31:57.6476636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6476773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6476849Z             )
2025-05-07T20:31:57.6476933Z         else:
2025-05-07T20:31:57.6477027Z             scale_ub_tensor = None
2025-05-07T20:31:57.6477101Z     
2025-05-07T20:31:57.6477238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6477327Z             op = silu_mul_quant
2025-05-07T20:31:57.6477413Z             if compiled:
2025-05-07T20:31:57.6477519Z                 op = torch.compile(op)
2025-05-07T20:31:57.6477626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6477706Z     
2025-05-07T20:31:57.6477796Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6477801Z 
2025-05-07T20:31:57.6477897Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6478040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6478140Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6478242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6478625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6478717Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6479226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6479330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6479700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6480039Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6480399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6480494Z     kernel = self.compile(
2025-05-07T20:31:57.6480895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6481073Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6481208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6481213Z 
2025-05-07T20:31:57.6481419Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea7e30>
2025-05-07T20:31:57.6482220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6482804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475e7aca0>}
2025-05-07T20:31:57.6483577Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6483775Z context = <triton._C.libtriton.ir.context object at 0x7f4475272570>
2025-05-07T20:31:57.6483779Z 
2025-05-07T20:31:57.6483949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6484221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6484335Z                            module_map=module_map)
2025-05-07T20:31:57.6484500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6484614Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6484691Z E       ^
2025-05-07T20:31:57.6485132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6485137Z 
2025-05-07T20:31:57.6485573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6485578Z 
2025-05-07T20:31:57.6485681Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6485917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6485993Z     T=1,
2025-05-07T20:31:57.6486069Z     D=7168,
2025-05-07T20:31:57.6486158Z     scale_ub=1200.0,
2025-05-07T20:31:57.6486245Z     contiguous=False,
2025-05-07T20:31:57.6486329Z     compiled=True,
2025-05-07T20:31:57.6486412Z )
2025-05-07T20:31:57.6486637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6486811Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6486816Z 
2025-05-07T20:31:57.6486902Z     @given(
2025-05-07T20:31:57.6487031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6487137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6487252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6487372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6487492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6487567Z     )
2025-05-07T20:31:57.6487819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6487919Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6487994Z         self,
2025-05-07T20:31:57.6488070Z         T: int,
2025-05-07T20:31:57.6488153Z         D: int,
2025-05-07T20:31:57.6488251Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6488339Z         contiguous: bool,
2025-05-07T20:31:57.6488509Z         compiled: bool,
2025-05-07T20:31:57.6488587Z     ) -> None:
2025-05-07T20:31:57.6488688Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6488760Z     
2025-05-07T20:31:57.6488937Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6489015Z     
2025-05-07T20:31:57.6489106Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6489230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6489324Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6489403Z         x0 = x[:, :D]
2025-05-07T20:31:57.6489482Z         x1 = x[:, D:]
2025-05-07T20:31:57.6489560Z     
2025-05-07T20:31:57.6489643Z         if contiguous:
2025-05-07T20:31:57.6489735Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6489828Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6489900Z     
2025-05-07T20:31:57.6489991Z         if scale_ub is not None:
2025-05-07T20:31:57.6490102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6490245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6490326Z             )
2025-05-07T20:31:57.6490401Z         else:
2025-05-07T20:31:57.6490494Z             scale_ub_tensor = None
2025-05-07T20:31:57.6490577Z     
2025-05-07T20:31:57.6490707Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6490797Z             op = silu_mul_quant
2025-05-07T20:31:57.6490887Z             if compiled:
2025-05-07T20:31:57.6490987Z                 op = torch.compile(op)
2025-05-07T20:31:57.6491091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6491172Z     
2025-05-07T20:31:57.6491260Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6491264Z 
2025-05-07T20:31:57.6491366Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6491498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6491597Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6491702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6492146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6492239Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6492835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6492934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6493306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6493534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6493883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6493984Z     kernel = self.compile(
2025-05-07T20:31:57.6494376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6494559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6494694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6494698Z 
2025-05-07T20:31:57.6494911Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475ea7530>
2025-05-07T20:31:57.6495718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6496233Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447521cea0>}
2025-05-07T20:31:57.6497025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6497297Z context = <triton._C.libtriton.ir.context object at 0x7f44751315b0>
2025-05-07T20:31:57.6497301Z 
2025-05-07T20:31:57.6503707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6504007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6504117Z                            module_map=module_map)
2025-05-07T20:31:57.6504290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6504386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6504471Z E       ^
2025-05-07T20:31:57.6504838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6504843Z 
2025-05-07T20:31:57.6505283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6505293Z 
2025-05-07T20:31:57.6505399Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6505628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6505713Z     T=1,
2025-05-07T20:31:57.6505795Z     D=7168,
2025-05-07T20:31:57.6505881Z     scale_ub=None,
2025-05-07T20:31:57.6505977Z     contiguous=False,
2025-05-07T20:31:57.6506061Z     compiled=True,
2025-05-07T20:31:57.6506415Z )
2025-05-07T20:31:57.6506732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6506917Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6506922Z 
2025-05-07T20:31:57.6507010Z     @given(
2025-05-07T20:31:57.6507135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6507235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6507358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6507475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6507597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6507681Z     )
2025-05-07T20:31:57.6507936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6508276Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6508357Z         self,
2025-05-07T20:31:57.6508436Z         T: int,
2025-05-07T20:31:57.6508520Z         D: int,
2025-05-07T20:31:57.6508619Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6508710Z         contiguous: bool,
2025-05-07T20:31:57.6508805Z         compiled: bool,
2025-05-07T20:31:57.6508886Z     ) -> None:
2025-05-07T20:31:57.6508982Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6509066Z     
2025-05-07T20:31:57.6509240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6509314Z     
2025-05-07T20:31:57.6509419Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6509545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6509647Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6509730Z         x0 = x[:, :D]
2025-05-07T20:31:57.6509811Z         x1 = x[:, D:]
2025-05-07T20:31:57.6509887Z     
2025-05-07T20:31:57.6509972Z         if contiguous:
2025-05-07T20:31:57.6510069Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6510168Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6510240Z     
2025-05-07T20:31:57.6510336Z         if scale_ub is not None:
2025-05-07T20:31:57.6510443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6510582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6510665Z             )
2025-05-07T20:31:57.6510740Z         else:
2025-05-07T20:31:57.6510835Z             scale_ub_tensor = None
2025-05-07T20:31:57.6510914Z     
2025-05-07T20:31:57.6511043Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6511136Z             op = silu_mul_quant
2025-05-07T20:31:57.6511226Z             if compiled:
2025-05-07T20:31:57.6511459Z                 op = torch.compile(op)
2025-05-07T20:31:57.6511565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6511647Z     
2025-05-07T20:31:57.6511755Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.6511905Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.6511993Z     
2025-05-07T20:31:57.6512130Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6512240Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.6512339Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.6512462Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.6512609Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6512682Z     
2025-05-07T20:31:57.6512783Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.6512788Z 
2025-05-07T20:31:57.6512897Z moe/activation_test.py:126: 
2025-05-07T20:31:57.6513031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6513150Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.6513289Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.6513877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.6513986Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.6514360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6514587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6514977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.6515239Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.6515632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.6515806Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.6516266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.6516351Z     fn()
2025-05-07T20:31:57.6516766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.6516854Z     self.fn.run(
2025-05-07T20:31:57.6517202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6517295Z     kernel = self.compile(
2025-05-07T20:31:57.6517691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6517868Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6518001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6518010Z 
2025-05-07T20:31:57.6518225Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475cd5d30>
2025-05-07T20:31:57.6519039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6519561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4475caa3e0>}
2025-05-07T20:31:57.6520335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6520534Z context = <triton._C.libtriton.ir.context object at 0x7f44770a3970>
2025-05-07T20:31:57.6520616Z 
2025-05-07T20:31:57.6520784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6521057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6521175Z                            module_map=module_map)
2025-05-07T20:31:57.6521339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6521441Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.6521525Z E       ^
2025-05-07T20:31:57.6521891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6521896Z 
2025-05-07T20:31:57.6522328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6522333Z 
2025-05-07T20:31:57.6522435Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6522663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6522751Z     T=1,
2025-05-07T20:31:57.6522829Z     D=5120,
2025-05-07T20:31:57.6522912Z     scale_ub=1200.0,
2025-05-07T20:31:57.6523005Z     contiguous=False,
2025-05-07T20:31:57.6523089Z     compiled=True,
2025-05-07T20:31:57.6523173Z )
2025-05-07T20:31:57.6523397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6523570Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6523574Z 
2025-05-07T20:31:57.6523655Z     @given(
2025-05-07T20:31:57.6523775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6523873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6523993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6524111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6524230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6524304Z     )
2025-05-07T20:31:57.6524559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6524659Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6524735Z         self,
2025-05-07T20:31:57.6524811Z         T: int,
2025-05-07T20:31:57.6524976Z         D: int,
2025-05-07T20:31:57.6525074Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6525161Z         contiguous: bool,
2025-05-07T20:31:57.6525251Z         compiled: bool,
2025-05-07T20:31:57.6525329Z     ) -> None:
2025-05-07T20:31:57.6525423Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6525502Z     
2025-05-07T20:31:57.6525673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6525745Z     
2025-05-07T20:31:57.6525844Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6525968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6526062Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6526141Z         x0 = x[:, :D]
2025-05-07T20:31:57.6526221Z         x1 = x[:, D:]
2025-05-07T20:31:57.6526308Z     
2025-05-07T20:31:57.6526392Z         if contiguous:
2025-05-07T20:31:57.6526487Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6526584Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6526657Z     
2025-05-07T20:31:57.6526752Z         if scale_ub is not None:
2025-05-07T20:31:57.6526863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6527001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6527077Z             )
2025-05-07T20:31:57.6527162Z         else:
2025-05-07T20:31:57.6527254Z             scale_ub_tensor = None
2025-05-07T20:31:57.6527333Z     
2025-05-07T20:31:57.6527465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6527554Z             op = silu_mul_quant
2025-05-07T20:31:57.6527645Z             if compiled:
2025-05-07T20:31:57.6527745Z                 op = torch.compile(op)
2025-05-07T20:31:57.6527851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6528013Z     
2025-05-07T20:31:57.6528103Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6528107Z 
2025-05-07T20:31:57.6528204Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6528351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6528454Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6528561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6528938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6529028Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6529545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6529642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6530010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6530244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6530601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6530700Z     kernel = self.compile(
2025-05-07T20:31:57.6531098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6531278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6531416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6531421Z 
2025-05-07T20:31:57.6531629Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475cd57c0>
2025-05-07T20:31:57.6532523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6533049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44767672e0>}
2025-05-07T20:31:57.6533906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6534111Z context = <triton._C.libtriton.ir.context object at 0x7f44750903f0>
2025-05-07T20:31:57.6534115Z 
2025-05-07T20:31:57.6534282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6534559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6534666Z                            module_map=module_map)
2025-05-07T20:31:57.6534830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6534933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6535014Z E       ^
2025-05-07T20:31:57.6535379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6535389Z 
2025-05-07T20:31:57.6535823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6535828Z 
2025-05-07T20:31:57.6535931Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6536166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6536242Z     T=1,
2025-05-07T20:31:57.6536318Z     D=5120,
2025-05-07T20:31:57.6536407Z     scale_ub=1200.0,
2025-05-07T20:31:57.6536496Z     contiguous=False,
2025-05-07T20:31:57.6536579Z     compiled=False,
2025-05-07T20:31:57.6536657Z )
2025-05-07T20:31:57.6536881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6537058Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6537145Z 
2025-05-07T20:31:57.6537222Z     @given(
2025-05-07T20:31:57.6537344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6537457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6537580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6537698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6537818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6537892Z     )
2025-05-07T20:31:57.6538152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6538244Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6538320Z         self,
2025-05-07T20:31:57.6538401Z         T: int,
2025-05-07T20:31:57.6538476Z         D: int,
2025-05-07T20:31:57.6538572Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6538668Z         contiguous: bool,
2025-05-07T20:31:57.6538752Z         compiled: bool,
2025-05-07T20:31:57.6538836Z     ) -> None:
2025-05-07T20:31:57.6538937Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6539009Z     
2025-05-07T20:31:57.6539179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6539265Z     
2025-05-07T20:31:57.6539364Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6539487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6539586Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6539666Z         x0 = x[:, :D]
2025-05-07T20:31:57.6539752Z         x1 = x[:, D:]
2025-05-07T20:31:57.6539825Z     
2025-05-07T20:31:57.6539908Z         if contiguous:
2025-05-07T20:31:57.6540007Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6540095Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6540169Z     
2025-05-07T20:31:57.6540267Z         if scale_ub is not None:
2025-05-07T20:31:57.6540371Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6540508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6540598Z             )
2025-05-07T20:31:57.6540686Z         else:
2025-05-07T20:31:57.6540813Z             scale_ub_tensor = None
2025-05-07T20:31:57.6540922Z     
2025-05-07T20:31:57.6541197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6541330Z             op = silu_mul_quant
2025-05-07T20:31:57.6541450Z             if compiled:
2025-05-07T20:31:57.6541566Z                 op = torch.compile(op)
2025-05-07T20:31:57.6541678Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6541755Z     
2025-05-07T20:31:57.6541847Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6541852Z 
2025-05-07T20:31:57.6541956Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6542088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6542189Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6542298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6542812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6542923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6543298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6543525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6543880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6543973Z     kernel = self.compile(
2025-05-07T20:31:57.6544371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6544554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6544684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6544689Z 
2025-05-07T20:31:57.6544904Z self = <triton.compiler.compiler.ASTSource object at 0x7f4475c8b9b0>
2025-05-07T20:31:57.6545808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6546331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44762d6ca0>}
2025-05-07T20:31:57.6547102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6547295Z context = <triton._C.libtriton.ir.context object at 0x7f447516cb30>
2025-05-07T20:31:57.6547300Z 
2025-05-07T20:31:57.6547475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6547750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6547862Z                            module_map=module_map)
2025-05-07T20:31:57.6548030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6548128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6548216Z E       ^
2025-05-07T20:31:57.6548580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6548585Z 
2025-05-07T20:31:57.6549011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6549025Z 
2025-05-07T20:31:57.6549128Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6549356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6549439Z     T=16384,
2025-05-07T20:31:57.6549515Z     D=5120,
2025-05-07T20:31:57.6549603Z     scale_ub=1200.0,
2025-05-07T20:31:57.6549696Z     contiguous=False,
2025-05-07T20:31:57.6549780Z     compiled=True,
2025-05-07T20:31:57.6549854Z )
2025-05-07T20:31:57.6550082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6550368Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6550373Z 
2025-05-07T20:31:57.6550449Z     @given(
2025-05-07T20:31:57.6550574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6550672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6550795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6550912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6551025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6551103Z     )
2025-05-07T20:31:57.6551360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6551453Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6551543Z         self,
2025-05-07T20:31:57.6551619Z         T: int,
2025-05-07T20:31:57.6551695Z         D: int,
2025-05-07T20:31:57.6551797Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6551892Z         contiguous: bool,
2025-05-07T20:31:57.6551984Z         compiled: bool,
2025-05-07T20:31:57.6552062Z     ) -> None:
2025-05-07T20:31:57.6552158Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6552236Z     
2025-05-07T20:31:57.6552406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6552480Z     
2025-05-07T20:31:57.6552581Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6552706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6552795Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6552883Z         x0 = x[:, :D]
2025-05-07T20:31:57.6552962Z         x1 = x[:, D:]
2025-05-07T20:31:57.6553036Z     
2025-05-07T20:31:57.6553123Z         if contiguous:
2025-05-07T20:31:57.6553214Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6553388Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6553467Z     
2025-05-07T20:31:57.6553559Z         if scale_ub is not None:
2025-05-07T20:31:57.6553671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6553814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6553889Z             )
2025-05-07T20:31:57.6553969Z         else:
2025-05-07T20:31:57.6554063Z             scale_ub_tensor = None
2025-05-07T20:31:57.6554134Z     
2025-05-07T20:31:57.6554270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6554359Z             op = silu_mul_quant
2025-05-07T20:31:57.6554445Z             if compiled:
2025-05-07T20:31:57.6554550Z                 op = torch.compile(op)
2025-05-07T20:31:57.6554655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6554726Z     
2025-05-07T20:31:57.6554822Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6554826Z 
2025-05-07T20:31:57.6554928Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6555064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6555164Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6555267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6555649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6555742Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6556252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6556356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6556721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6556953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6557304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6557404Z     kernel = self.compile(
2025-05-07T20:31:57.6557896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6558075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6558216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6558220Z 
2025-05-07T20:31:57.6558428Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762b54c0>
2025-05-07T20:31:57.6559232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6559757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44764ef380>}
2025-05-07T20:31:57.6560549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6560750Z context = <triton._C.libtriton.ir.context object at 0x7f4475002830>
2025-05-07T20:31:57.6560754Z 
2025-05-07T20:31:57.6560923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6561195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6561308Z                            module_map=module_map)
2025-05-07T20:31:57.6561475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6561580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6561657Z E       ^
2025-05-07T20:31:57.6562026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6562108Z 
2025-05-07T20:31:57.6562553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6562563Z 
2025-05-07T20:31:57.6562668Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6562906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6562983Z     T=2048,
2025-05-07T20:31:57.6563061Z     D=7168,
2025-05-07T20:31:57.6563154Z     scale_ub=1200.0,
2025-05-07T20:31:57.6563241Z     contiguous=False,
2025-05-07T20:31:57.6563329Z     compiled=True,
2025-05-07T20:31:57.6563409Z )
2025-05-07T20:31:57.6563636Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6563819Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6563823Z 
2025-05-07T20:31:57.6563907Z     @given(
2025-05-07T20:31:57.6564026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6564139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6564253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6564375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6564494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6564568Z     )
2025-05-07T20:31:57.6564827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6564926Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6565002Z         self,
2025-05-07T20:31:57.6565079Z         T: int,
2025-05-07T20:31:57.6565175Z         D: int,
2025-05-07T20:31:57.6565274Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6565373Z         contiguous: bool,
2025-05-07T20:31:57.6565460Z         compiled: bool,
2025-05-07T20:31:57.6565538Z     ) -> None:
2025-05-07T20:31:57.6565643Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6565714Z     
2025-05-07T20:31:57.6565889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6565969Z     
2025-05-07T20:31:57.6566062Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6566277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6566368Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6566450Z         x0 = x[:, :D]
2025-05-07T20:31:57.6566537Z         x1 = x[:, D:]
2025-05-07T20:31:57.6566610Z     
2025-05-07T20:31:57.6566694Z         if contiguous:
2025-05-07T20:31:57.6566796Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6566886Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6566959Z     
2025-05-07T20:31:57.6567058Z         if scale_ub is not None:
2025-05-07T20:31:57.6567165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6567302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6567387Z             )
2025-05-07T20:31:57.6567465Z         else:
2025-05-07T20:31:57.6567566Z             scale_ub_tensor = None
2025-05-07T20:31:57.6567644Z     
2025-05-07T20:31:57.6567775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6567873Z             op = silu_mul_quant
2025-05-07T20:31:57.6567959Z             if compiled:
2025-05-07T20:31:57.6568065Z                 op = torch.compile(op)
2025-05-07T20:31:57.6568182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6568255Z     
2025-05-07T20:31:57.6568346Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6568350Z 
2025-05-07T20:31:57.6568454Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6568584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6568692Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6568791Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6569171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6569270Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6569870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6569970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6570352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6570582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6570943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6571036Z     kernel = self.compile(
2025-05-07T20:31:57.6571432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6571620Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6571901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6571921Z 
2025-05-07T20:31:57.6572140Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762b76b0>
2025-05-07T20:31:57.6572971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6573493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44764efd80>}
2025-05-07T20:31:57.6574284Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6574479Z context = <triton._C.libtriton.ir.context object at 0x7f4475008630>
2025-05-07T20:31:57.6574483Z 
2025-05-07T20:31:57.6574656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6574935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6575132Z                            module_map=module_map)
2025-05-07T20:31:57.6575306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6575404Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6575480Z E       ^
2025-05-07T20:31:57.6575857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6575861Z 
2025-05-07T20:31:57.6576296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6576300Z 
2025-05-07T20:31:57.6576409Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6576641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6576722Z     T=1,
2025-05-07T20:31:57.6576806Z     D=5120,
2025-05-07T20:31:57.6576888Z     scale_ub=None,
2025-05-07T20:31:57.6576974Z     contiguous=False,
2025-05-07T20:31:57.6577067Z     compiled=False,
2025-05-07T20:31:57.6577138Z )
2025-05-07T20:31:57.6577376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6577547Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6577551Z 
2025-05-07T20:31:57.6577626Z     @given(
2025-05-07T20:31:57.6577751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6577848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6577961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6578085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6578199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6578274Z     )
2025-05-07T20:31:57.6578535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6578742Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6578823Z         self,
2025-05-07T20:31:57.6578901Z         T: int,
2025-05-07T20:31:57.6578978Z         D: int,
2025-05-07T20:31:57.6579085Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6579175Z         contiguous: bool,
2025-05-07T20:31:57.6579259Z         compiled: bool,
2025-05-07T20:31:57.6579345Z     ) -> None:
2025-05-07T20:31:57.6579439Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6579511Z     
2025-05-07T20:31:57.6579689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6580194Z     
2025-05-07T20:31:57.6580294Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6580420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6580510Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6580595Z         x0 = x[:, :D]
2025-05-07T20:31:57.6580676Z         x1 = x[:, D:]
2025-05-07T20:31:57.6580749Z     
2025-05-07T20:31:57.6580845Z         if contiguous:
2025-05-07T20:31:57.6580936Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6581025Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6581106Z     
2025-05-07T20:31:57.6581196Z         if scale_ub is not None:
2025-05-07T20:31:57.6581309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6581453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6581529Z             )
2025-05-07T20:31:57.6581604Z         else:
2025-05-07T20:31:57.6581708Z             scale_ub_tensor = None
2025-05-07T20:31:57.6581780Z     
2025-05-07T20:31:57.6581922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6582016Z             op = silu_mul_quant
2025-05-07T20:31:57.6582101Z             if compiled:
2025-05-07T20:31:57.6582208Z                 op = torch.compile(op)
2025-05-07T20:31:57.6582313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6582385Z     
2025-05-07T20:31:57.6582482Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6582491Z 
2025-05-07T20:31:57.6582592Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6582725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6582922Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6583025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6583554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6583651Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6584024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6584262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6584620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6584715Z     kernel = self.compile(
2025-05-07T20:31:57.6585124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6585303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6585446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6585451Z 
2025-05-07T20:31:57.6585659Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476462150>
2025-05-07T20:31:57.6586479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6587009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476edf4c0>}
2025-05-07T20:31:57.6587794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6588082Z context = <triton._C.libtriton.ir.context object at 0x7f447504d670>
2025-05-07T20:31:57.6588086Z 
2025-05-07T20:31:57.6588255Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6588533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6588640Z                            module_map=module_map)
2025-05-07T20:31:57.6588806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6588911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6588987Z E       ^
2025-05-07T20:31:57.6589357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6589361Z 
2025-05-07T20:31:57.6589800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6589811Z 
2025-05-07T20:31:57.6589914Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6590159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6590236Z     T=4096,
2025-05-07T20:31:57.6590311Z     D=7168,
2025-05-07T20:31:57.6590402Z     scale_ub=1200.0,
2025-05-07T20:31:57.6590488Z     contiguous=False,
2025-05-07T20:31:57.6590571Z     compiled=False,
2025-05-07T20:31:57.6590653Z )
2025-05-07T20:31:57.6590879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6591061Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6591072Z 
2025-05-07T20:31:57.6591148Z     @given(
2025-05-07T20:31:57.6591268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6591374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6591494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6591614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6591732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6591805Z     )
2025-05-07T20:31:57.6592139Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6592242Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6592321Z         self,
2025-05-07T20:31:57.6592404Z         T: int,
2025-05-07T20:31:57.6592482Z         D: int,
2025-05-07T20:31:57.6592580Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6592673Z         contiguous: bool,
2025-05-07T20:31:57.6592758Z         compiled: bool,
2025-05-07T20:31:57.6592836Z     ) -> None:
2025-05-07T20:31:57.6592936Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6593009Z     
2025-05-07T20:31:57.6593181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6593259Z     
2025-05-07T20:31:57.6593357Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6593485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6593579Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6593660Z         x0 = x[:, :D]
2025-05-07T20:31:57.6593746Z         x1 = x[:, D:]
2025-05-07T20:31:57.6593823Z     
2025-05-07T20:31:57.6593906Z         if contiguous:
2025-05-07T20:31:57.6594003Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6594093Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6594168Z     
2025-05-07T20:31:57.6594267Z         if scale_ub is not None:
2025-05-07T20:31:57.6594372Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6594510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6594591Z             )
2025-05-07T20:31:57.6594671Z         else:
2025-05-07T20:31:57.6594770Z             scale_ub_tensor = None
2025-05-07T20:31:57.6594849Z     
2025-05-07T20:31:57.6594980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6595154Z             op = silu_mul_quant
2025-05-07T20:31:57.6595250Z             if compiled:
2025-05-07T20:31:57.6595349Z                 op = torch.compile(op)
2025-05-07T20:31:57.6595469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6595540Z     
2025-05-07T20:31:57.6595630Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6595635Z 
2025-05-07T20:31:57.6595741Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6595874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6595980Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6596086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6596606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6596702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6597082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6597317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6597685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6597781Z     kernel = self.compile(
2025-05-07T20:31:57.6598178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6598364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6598494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6598499Z 
2025-05-07T20:31:57.6598714Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476461520>
2025-05-07T20:31:57.6599536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6600145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476b58540>}
2025-05-07T20:31:57.6600940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6601135Z context = <triton._C.libtriton.ir.context object at 0x7f447508dcf0>
2025-05-07T20:31:57.6601139Z 
2025-05-07T20:31:57.6601317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6601590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6601698Z                            module_map=module_map)
2025-05-07T20:31:57.6601871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6601977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6602063Z E       ^
2025-05-07T20:31:57.6602442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6602446Z 
2025-05-07T20:31:57.6602880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6602885Z 
2025-05-07T20:31:57.6602993Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6603227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6603310Z     T=16384,
2025-05-07T20:31:57.6603386Z     D=7168,
2025-05-07T20:31:57.6603467Z     scale_ub=None,
2025-05-07T20:31:57.6603558Z     contiguous=True,
2025-05-07T20:31:57.6603640Z     compiled=True,
2025-05-07T20:31:57.6603713Z )
2025-05-07T20:31:57.6603945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6604206Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6604210Z 
2025-05-07T20:31:57.6604285Z     @given(
2025-05-07T20:31:57.6604412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6604518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6604639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6604757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6604872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6604953Z     )
2025-05-07T20:31:57.6605209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6605302Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6605384Z         self,
2025-05-07T20:31:57.6605462Z         T: int,
2025-05-07T20:31:57.6605538Z         D: int,
2025-05-07T20:31:57.6605641Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6605730Z         contiguous: bool,
2025-05-07T20:31:57.6605820Z         compiled: bool,
2025-05-07T20:31:57.6605903Z     ) -> None:
2025-05-07T20:31:57.6605997Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6606068Z     
2025-05-07T20:31:57.6606550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6606655Z     
2025-05-07T20:31:57.6606791Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6606928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6607020Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6607106Z         x0 = x[:, :D]
2025-05-07T20:31:57.6607186Z         x1 = x[:, D:]
2025-05-07T20:31:57.6607260Z     
2025-05-07T20:31:57.6607350Z         if contiguous:
2025-05-07T20:31:57.6607443Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6607533Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6607610Z     
2025-05-07T20:31:57.6607703Z         if scale_ub is not None:
2025-05-07T20:31:57.6607808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6607961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6608035Z             )
2025-05-07T20:31:57.6608117Z         else:
2025-05-07T20:31:57.6608211Z             scale_ub_tensor = None
2025-05-07T20:31:57.6608515Z     
2025-05-07T20:31:57.6608656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6608748Z             op = silu_mul_quant
2025-05-07T20:31:57.6608832Z             if compiled:
2025-05-07T20:31:57.6608938Z                 op = torch.compile(op)
2025-05-07T20:31:57.6609042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6609115Z     
2025-05-07T20:31:57.6609210Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6609215Z 
2025-05-07T20:31:57.6609312Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6609454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6609555Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6609658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6610048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6610140Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6610655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6610761Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6611129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6611365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6611718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6611893Z     kernel = self.compile(
2025-05-07T20:31:57.6612298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6612646Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6612777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6612787Z 
2025-05-07T20:31:57.6613002Z self = <triton.compiler.compiler.ASTSource object at 0x7f44764f1100>
2025-05-07T20:31:57.6613805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6614329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b899e0>}
2025-05-07T20:31:57.6615102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6615306Z context = <triton._C.libtriton.ir.context object at 0x7f4475017470>
2025-05-07T20:31:57.6615311Z 
2025-05-07T20:31:57.6615482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6615752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6615867Z                            module_map=module_map)
2025-05-07T20:31:57.6616032Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6616131Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6616214Z E       ^
2025-05-07T20:31:57.6616579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6616583Z 
2025-05-07T20:31:57.6617018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6617025Z 
2025-05-07T20:31:57.6617128Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6617359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6617443Z     T=4096,
2025-05-07T20:31:57.6617607Z     D=5120,
2025-05-07T20:31:57.6617697Z     scale_ub=None,
2025-05-07T20:31:57.6617784Z     contiguous=False,
2025-05-07T20:31:57.6617868Z     compiled=True,
2025-05-07T20:31:57.6617946Z )
2025-05-07T20:31:57.6618169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6618348Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6618352Z 
2025-05-07T20:31:57.6618434Z     @given(
2025-05-07T20:31:57.6618554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6618652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6618775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6618894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6619022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6619096Z     )
2025-05-07T20:31:57.6619347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6619450Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6619526Z         self,
2025-05-07T20:31:57.6619603Z         T: int,
2025-05-07T20:31:57.6619685Z         D: int,
2025-05-07T20:31:57.6619785Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6619877Z         contiguous: bool,
2025-05-07T20:31:57.6619969Z         compiled: bool,
2025-05-07T20:31:57.6620046Z     ) -> None:
2025-05-07T20:31:57.6620139Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6620216Z     
2025-05-07T20:31:57.6620385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6620465Z     
2025-05-07T20:31:57.6620557Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6620680Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6620862Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6620943Z         x0 = x[:, :D]
2025-05-07T20:31:57.6621022Z         x1 = x[:, D:]
2025-05-07T20:31:57.6621104Z     
2025-05-07T20:31:57.6621187Z         if contiguous:
2025-05-07T20:31:57.6621289Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6621387Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6621460Z     
2025-05-07T20:31:57.6621549Z         if scale_ub is not None:
2025-05-07T20:31:57.6621664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6621802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6621879Z             )
2025-05-07T20:31:57.6621962Z         else:
2025-05-07T20:31:57.6622057Z             scale_ub_tensor = None
2025-05-07T20:31:57.6622134Z     
2025-05-07T20:31:57.6622264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6622352Z             op = silu_mul_quant
2025-05-07T20:31:57.6622442Z             if compiled:
2025-05-07T20:31:57.6622546Z                 op = torch.compile(op)
2025-05-07T20:31:57.6622650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6622729Z     
2025-05-07T20:31:57.6622819Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6622824Z 
2025-05-07T20:31:57.6622924Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6623060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6623161Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6623267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6623641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6623737Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6624256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6624354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6624719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6624955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6625404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6625506Z     kernel = self.compile(
2025-05-07T20:31:57.6625900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6632447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6632600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6632606Z 
2025-05-07T20:31:57.6632827Z self = <triton.compiler.compiler.ASTSource object at 0x7f447641fa40>
2025-05-07T20:31:57.6633638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6634183Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b5fba0>}
2025-05-07T20:31:57.6634961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6635157Z context = <triton._C.libtriton.ir.context object at 0x7f4475698670>
2025-05-07T20:31:57.6635161Z 
2025-05-07T20:31:57.6635341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6635618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6635738Z                            module_map=module_map)
2025-05-07T20:31:57.6636018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6636120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6636207Z E       ^
2025-05-07T20:31:57.6636579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6636584Z 
2025-05-07T20:31:57.6637013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6637027Z 
2025-05-07T20:31:57.6637138Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6637369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6637455Z     T=4096,
2025-05-07T20:31:57.6637533Z     D=5120,
2025-05-07T20:31:57.6637620Z     scale_ub=1200.0,
2025-05-07T20:31:57.6637723Z     contiguous=False,
2025-05-07T20:31:57.6637809Z     compiled=False,
2025-05-07T20:31:57.6637882Z )
2025-05-07T20:31:57.6638113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6638300Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6638305Z 
2025-05-07T20:31:57.6638393Z     @given(
2025-05-07T20:31:57.6638521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6638623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6638748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6638866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6638981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6639065Z     )
2025-05-07T20:31:57.6639318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6639414Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6639500Z         self,
2025-05-07T20:31:57.6639578Z         T: int,
2025-05-07T20:31:57.6639664Z         D: int,
2025-05-07T20:31:57.6639764Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6639860Z         contiguous: bool,
2025-05-07T20:31:57.6639949Z         compiled: bool,
2025-05-07T20:31:57.6640028Z     ) -> None:
2025-05-07T20:31:57.6640124Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6640289Z     
2025-05-07T20:31:57.6640464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6640546Z     
2025-05-07T20:31:57.6640640Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6640767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6640863Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6640943Z         x0 = x[:, :D]
2025-05-07T20:31:57.6641025Z         x1 = x[:, D:]
2025-05-07T20:31:57.6641107Z     
2025-05-07T20:31:57.6641194Z         if contiguous:
2025-05-07T20:31:57.6641285Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6641383Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6641457Z     
2025-05-07T20:31:57.6641548Z         if scale_ub is not None:
2025-05-07T20:31:57.6641665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6641803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6641885Z             )
2025-05-07T20:31:57.6641963Z         else:
2025-05-07T20:31:57.6642064Z             scale_ub_tensor = None
2025-05-07T20:31:57.6642148Z     
2025-05-07T20:31:57.6642282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6642372Z             op = silu_mul_quant
2025-05-07T20:31:57.6642465Z             if compiled:
2025-05-07T20:31:57.6642566Z                 op = torch.compile(op)
2025-05-07T20:31:57.6642671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6642752Z     
2025-05-07T20:31:57.6642843Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6642847Z 
2025-05-07T20:31:57.6642945Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6643086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6643188Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6643379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6643893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6643995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6644378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6644605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6644954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6645055Z     kernel = self.compile(
2025-05-07T20:31:57.6645449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6645631Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6645766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6645771Z 
2025-05-07T20:31:57.6645978Z self = <triton.compiler.compiler.ASTSource object at 0x7f44764f3950>
2025-05-07T20:31:57.6646794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6647311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477b5d440>}
2025-05-07T20:31:57.6648088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6648280Z context = <triton._C.libtriton.ir.context object at 0x7f4474f0f6f0>
2025-05-07T20:31:57.6648288Z 
2025-05-07T20:31:57.6648461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6648844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6648953Z                            module_map=module_map)
2025-05-07T20:31:57.6649125Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6649227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6649303Z E       ^
2025-05-07T20:31:57.6649673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6649678Z 
2025-05-07T20:31:57.6650105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6650109Z 
2025-05-07T20:31:57.6650219Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6650447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6650529Z     T=4096,
2025-05-07T20:31:57.6650612Z     D=5120,
2025-05-07T20:31:57.6650694Z     scale_ub=1200.0,
2025-05-07T20:31:57.6650781Z     contiguous=False,
2025-05-07T20:31:57.6650879Z     compiled=True,
2025-05-07T20:31:57.6650952Z )
2025-05-07T20:31:57.6651175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6651360Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6651365Z 
2025-05-07T20:31:57.6651440Z     @given(
2025-05-07T20:31:57.6651567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6651668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6651898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6652024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6652138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6652211Z     )
2025-05-07T20:31:57.6652554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6652649Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6652729Z         self,
2025-05-07T20:31:57.6652805Z         T: int,
2025-05-07T20:31:57.6652888Z         D: int,
2025-05-07T20:31:57.6652990Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6653082Z         contiguous: bool,
2025-05-07T20:31:57.6653168Z         compiled: bool,
2025-05-07T20:31:57.6653252Z     ) -> None:
2025-05-07T20:31:57.6653348Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6653421Z     
2025-05-07T20:31:57.6653595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6653668Z     
2025-05-07T20:31:57.6653759Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6653889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6653977Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6654057Z         x0 = x[:, :D]
2025-05-07T20:31:57.6654152Z         x1 = x[:, D:]
2025-05-07T20:31:57.6654225Z     
2025-05-07T20:31:57.6654320Z         if contiguous:
2025-05-07T20:31:57.6654412Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6654503Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6654586Z     
2025-05-07T20:31:57.6654681Z         if scale_ub is not None:
2025-05-07T20:31:57.6654788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6654933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6655008Z             )
2025-05-07T20:31:57.6655086Z         else:
2025-05-07T20:31:57.6655187Z             scale_ub_tensor = None
2025-05-07T20:31:57.6655260Z     
2025-05-07T20:31:57.6655390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6655488Z             op = silu_mul_quant
2025-05-07T20:31:57.6655574Z             if compiled:
2025-05-07T20:31:57.6655682Z                 op = torch.compile(op)
2025-05-07T20:31:57.6655786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6655864Z     
2025-05-07T20:31:57.6655962Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6655966Z 
2025-05-07T20:31:57.6656064Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6656280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6656391Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6656491Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6656869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6656969Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6657479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6657580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6657947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6658182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6658541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6658640Z     kernel = self.compile(
2025-05-07T20:31:57.6659039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6659217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6659347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6659351Z 
2025-05-07T20:31:57.6659563Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f60c50>
2025-05-07T20:31:57.6660370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6660970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44779ab920>}
2025-05-07T20:31:57.6661746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6661941Z context = <triton._C.libtriton.ir.context object at 0x7f447545bcb0>
2025-05-07T20:31:57.6661945Z 
2025-05-07T20:31:57.6662118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6662391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6662504Z                            module_map=module_map)
2025-05-07T20:31:57.6662670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6662768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6662859Z E       ^
2025-05-07T20:31:57.6663223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6663227Z 
2025-05-07T20:31:57.6663666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6663670Z 
2025-05-07T20:31:57.6663773Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6664001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6664083Z     T=2048,
2025-05-07T20:31:57.6664160Z     D=7168,
2025-05-07T20:31:57.6664242Z     scale_ub=1200.0,
2025-05-07T20:31:57.6664334Z     contiguous=False,
2025-05-07T20:31:57.6664418Z     compiled=False,
2025-05-07T20:31:57.6664489Z )
2025-05-07T20:31:57.6664720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6664901Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6664911Z 
2025-05-07T20:31:57.6664992Z     @given(
2025-05-07T20:31:57.6665112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6665291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6665413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6665530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6665644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6665724Z     )
2025-05-07T20:31:57.6665977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6666069Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6666150Z         self,
2025-05-07T20:31:57.6666226Z         T: int,
2025-05-07T20:31:57.6666309Z         D: int,
2025-05-07T20:31:57.6666407Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6666496Z         contiguous: bool,
2025-05-07T20:31:57.6666586Z         compiled: bool,
2025-05-07T20:31:57.6666669Z     ) -> None:
2025-05-07T20:31:57.6666765Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6666842Z     
2025-05-07T20:31:57.6667012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6667089Z     
2025-05-07T20:31:57.6667186Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6667310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6667400Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6667487Z         x0 = x[:, :D]
2025-05-07T20:31:57.6667567Z         x1 = x[:, D:]
2025-05-07T20:31:57.6667639Z     
2025-05-07T20:31:57.6667729Z         if contiguous:
2025-05-07T20:31:57.6667821Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6667917Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6667989Z     
2025-05-07T20:31:57.6668080Z         if scale_ub is not None:
2025-05-07T20:31:57.6668189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6668326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6668484Z             )
2025-05-07T20:31:57.6668568Z         else:
2025-05-07T20:31:57.6668662Z             scale_ub_tensor = None
2025-05-07T20:31:57.6668737Z     
2025-05-07T20:31:57.6668878Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6668968Z             op = silu_mul_quant
2025-05-07T20:31:57.6669053Z             if compiled:
2025-05-07T20:31:57.6669159Z                 op = torch.compile(op)
2025-05-07T20:31:57.6669263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6669343Z     
2025-05-07T20:31:57.6669433Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6669437Z 
2025-05-07T20:31:57.6669534Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6669671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6669772Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6669871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6670392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6670495Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6670876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6671103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6671455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6671555Z     kernel = self.compile(
2025-05-07T20:31:57.6671979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6672181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6672317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6672321Z 
2025-05-07T20:31:57.6672532Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477440560>
2025-05-07T20:31:57.6673422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6673939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44840996c0>}
2025-05-07T20:31:57.6674718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6674910Z context = <triton._C.libtriton.ir.context object at 0x7f4474e09a70>
2025-05-07T20:31:57.6674914Z 
2025-05-07T20:31:57.6675082Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6675366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6675474Z                            module_map=module_map)
2025-05-07T20:31:57.6675643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6675749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6675829Z E       ^
2025-05-07T20:31:57.6676202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6676206Z 
2025-05-07T20:31:57.6676632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6676636Z 
2025-05-07T20:31:57.6676740Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6676973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6677049Z     T=1,
2025-05-07T20:31:57.6677134Z     D=7168,
2025-05-07T20:31:57.6677299Z     scale_ub=None,
2025-05-07T20:31:57.6677388Z     contiguous=True,
2025-05-07T20:31:57.6677478Z     compiled=False,
2025-05-07T20:31:57.6677551Z )
2025-05-07T20:31:57.6677779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6677954Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.6677959Z 
2025-05-07T20:31:57.6678036Z     @given(
2025-05-07T20:31:57.6678155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6678263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6678379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6678503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6678617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6678691Z     )
2025-05-07T20:31:57.6678947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6679046Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6679122Z         self,
2025-05-07T20:31:57.6679205Z         T: int,
2025-05-07T20:31:57.6679281Z         D: int,
2025-05-07T20:31:57.6679379Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6679480Z         contiguous: bool,
2025-05-07T20:31:57.6679566Z         compiled: bool,
2025-05-07T20:31:57.6679645Z     ) -> None:
2025-05-07T20:31:57.6679744Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6679816Z     
2025-05-07T20:31:57.6679996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6680070Z     
2025-05-07T20:31:57.6680160Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6680290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6680381Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6680461Z         x0 = x[:, :D]
2025-05-07T20:31:57.6680547Z         x1 = x[:, D:]
2025-05-07T20:31:57.6680618Z     
2025-05-07T20:31:57.6680705Z         if contiguous:
2025-05-07T20:31:57.6680809Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6680899Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6680970Z     
2025-05-07T20:31:57.6681066Z         if scale_ub is not None:
2025-05-07T20:31:57.6681278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6681417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6681504Z             )
2025-05-07T20:31:57.6681579Z         else:
2025-05-07T20:31:57.6681679Z             scale_ub_tensor = None
2025-05-07T20:31:57.6681751Z     
2025-05-07T20:31:57.6681880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6681976Z             op = silu_mul_quant
2025-05-07T20:31:57.6682060Z             if compiled:
2025-05-07T20:31:57.6682179Z                 op = torch.compile(op)
2025-05-07T20:31:57.6682302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6682389Z     
2025-05-07T20:31:57.6682478Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6682483Z 
2025-05-07T20:31:57.6682591Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6682721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6682826Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6682934Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6683446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6683547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6683914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6684138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6684494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6684589Z     kernel = self.compile(
2025-05-07T20:31:57.6684991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6685252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6685390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6685394Z 
2025-05-07T20:31:57.6685606Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f63d70>
2025-05-07T20:31:57.6686410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6686931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484503c40>}
2025-05-07T20:31:57.6687701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6687899Z context = <triton._C.libtriton.ir.context object at 0x7f4475b679f0>
2025-05-07T20:31:57.6687910Z 
2025-05-07T20:31:57.6688083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6688353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6688466Z                            module_map=module_map)
2025-05-07T20:31:57.6688631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6688731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6688816Z E       ^
2025-05-07T20:31:57.6689179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6689184Z 
2025-05-07T20:31:57.6689616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6689626Z 
2025-05-07T20:31:57.6689729Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6690041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6690128Z     T=16384,
2025-05-07T20:31:57.6690212Z     D=7168,
2025-05-07T20:31:57.6690299Z     scale_ub=1200.0,
2025-05-07T20:31:57.6690394Z     contiguous=False,
2025-05-07T20:31:57.6690477Z     compiled=True,
2025-05-07T20:31:57.6690551Z )
2025-05-07T20:31:57.6690782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6690967Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6690972Z 
2025-05-07T20:31:57.6691057Z     @given(
2025-05-07T20:31:57.6691178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6691278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6691398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6691520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6691637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6691717Z     )
2025-05-07T20:31:57.6692078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6692178Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6692256Z         self,
2025-05-07T20:31:57.6692342Z         T: int,
2025-05-07T20:31:57.6692440Z         D: int,
2025-05-07T20:31:57.6692549Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6692649Z         contiguous: bool,
2025-05-07T20:31:57.6692740Z         compiled: bool,
2025-05-07T20:31:57.6692818Z     ) -> None:
2025-05-07T20:31:57.6692915Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6692992Z     
2025-05-07T20:31:57.6693162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6693252Z     
2025-05-07T20:31:57.6693345Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6693555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6693653Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6693734Z         x0 = x[:, :D]
2025-05-07T20:31:57.6693815Z         x1 = x[:, D:]
2025-05-07T20:31:57.6693904Z     
2025-05-07T20:31:57.6693988Z         if contiguous:
2025-05-07T20:31:57.6694080Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6694178Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6694251Z     
2025-05-07T20:31:57.6694347Z         if scale_ub is not None:
2025-05-07T20:31:57.6694454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6694595Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6694680Z             )
2025-05-07T20:31:57.6694756Z         else:
2025-05-07T20:31:57.6694852Z             scale_ub_tensor = None
2025-05-07T20:31:57.6694933Z     
2025-05-07T20:31:57.6695063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6695154Z             op = silu_mul_quant
2025-05-07T20:31:57.6695251Z             if compiled:
2025-05-07T20:31:57.6695350Z                 op = torch.compile(op)
2025-05-07T20:31:57.6695461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6695544Z     
2025-05-07T20:31:57.6695643Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6695647Z 
2025-05-07T20:31:57.6695751Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6695885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6695986Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6696095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6696473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6696567Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6697081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6697182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6697554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6697862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6698215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6698317Z     kernel = self.compile(
2025-05-07T20:31:57.6698712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6698895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6699026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6699030Z 
2025-05-07T20:31:57.6699237Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477b31310>
2025-05-07T20:31:57.6700049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6700575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484a9da80>}
2025-05-07T20:31:57.6701349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6701540Z context = <triton._C.libtriton.ir.context object at 0x7f4475ba06b0>
2025-05-07T20:31:57.6701544Z 
2025-05-07T20:31:57.6701711Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6702006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6702217Z                            module_map=module_map)
2025-05-07T20:31:57.6702393Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6702492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6702573Z E       ^
2025-05-07T20:31:57.6702943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6702947Z 
2025-05-07T20:31:57.6703372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6703376Z 
2025-05-07T20:31:57.6703485Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6703713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6703790Z     T=1,
2025-05-07T20:31:57.6703875Z     D=7168,
2025-05-07T20:31:57.6703957Z     scale_ub=None,
2025-05-07T20:31:57.6704042Z     contiguous=False,
2025-05-07T20:31:57.6704138Z     compiled=False,
2025-05-07T20:31:57.6704210Z )
2025-05-07T20:31:57.6704435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6704614Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6704623Z 
2025-05-07T20:31:57.6704701Z     @given(
2025-05-07T20:31:57.6704829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6704927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6705041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6705166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6705279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6705351Z     )
2025-05-07T20:31:57.6705610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6705701Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6705779Z         self,
2025-05-07T20:31:57.6705860Z         T: int,
2025-05-07T20:31:57.6705939Z         D: int,
2025-05-07T20:31:57.6706037Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6706138Z         contiguous: bool,
2025-05-07T20:31:57.6706514Z         compiled: bool,
2025-05-07T20:31:57.6706634Z     ) -> None:
2025-05-07T20:31:57.6707021Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6707106Z     
2025-05-07T20:31:57.6707286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6707362Z     
2025-05-07T20:31:57.6707456Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6707587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6707677Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6707756Z         x0 = x[:, :D]
2025-05-07T20:31:57.6707841Z         x1 = x[:, D:]
2025-05-07T20:31:57.6707912Z     
2025-05-07T20:31:57.6707995Z         if contiguous:
2025-05-07T20:31:57.6708094Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6708182Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6708259Z     
2025-05-07T20:31:57.6708356Z         if scale_ub is not None:
2025-05-07T20:31:57.6708460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6708604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6708689Z             )
2025-05-07T20:31:57.6708765Z         else:
2025-05-07T20:31:57.6708870Z             scale_ub_tensor = None
2025-05-07T20:31:57.6708943Z     
2025-05-07T20:31:57.6709073Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6709170Z             op = silu_mul_quant
2025-05-07T20:31:57.6709256Z             if compiled:
2025-05-07T20:31:57.6709354Z                 op = torch.compile(op)
2025-05-07T20:31:57.6709466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6709539Z     
2025-05-07T20:31:57.6709630Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6709641Z 
2025-05-07T20:31:57.6709741Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6709873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6710149Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6710252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6710774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6710880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6711250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6711505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6711889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6711983Z     kernel = self.compile(
2025-05-07T20:31:57.6712385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6712563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6712703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6712707Z 
2025-05-07T20:31:57.6712930Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477b309e0>
2025-05-07T20:31:57.6713732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6714258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484b756c0>}
2025-05-07T20:31:57.6715030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6715234Z context = <triton._C.libtriton.ir.context object at 0x7f4475bee370>
2025-05-07T20:31:57.6715239Z 
2025-05-07T20:31:57.6715408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6715757Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6715875Z                            module_map=module_map)
2025-05-07T20:31:57.6716042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6716141Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6716227Z E       ^
2025-05-07T20:31:57.6716593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6716598Z 
2025-05-07T20:31:57.6717031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6717035Z 
2025-05-07T20:31:57.6717137Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6717371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6717453Z     T=2048,
2025-05-07T20:31:57.6717529Z     D=7168,
2025-05-07T20:31:57.6717611Z     scale_ub=None,
2025-05-07T20:31:57.6717708Z     contiguous=False,
2025-05-07T20:31:57.6717791Z     compiled=True,
2025-05-07T20:31:57.6717870Z )
2025-05-07T20:31:57.6718096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6718276Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6718280Z 
2025-05-07T20:31:57.6718362Z     @given(
2025-05-07T20:31:57.6718482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6718580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6718703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6718819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6718932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6719097Z     )
2025-05-07T20:31:57.6719348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6719446Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6719529Z         self,
2025-05-07T20:31:57.6719604Z         T: int,
2025-05-07T20:31:57.6719685Z         D: int,
2025-05-07T20:31:57.6719783Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6719871Z         contiguous: bool,
2025-05-07T20:31:57.6719961Z         compiled: bool,
2025-05-07T20:31:57.6720037Z     ) -> None:
2025-05-07T20:31:57.6720134Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6720212Z     
2025-05-07T20:31:57.6720381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6720453Z     
2025-05-07T20:31:57.6720551Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6720675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6720769Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6720855Z         x0 = x[:, :D]
2025-05-07T20:31:57.6720934Z         x1 = x[:, D:]
2025-05-07T20:31:57.6721011Z     
2025-05-07T20:31:57.6721094Z         if contiguous:
2025-05-07T20:31:57.6721185Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6721283Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6721354Z     
2025-05-07T20:31:57.6721445Z         if scale_ub is not None:
2025-05-07T20:31:57.6721558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6721710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6721794Z             )
2025-05-07T20:31:57.6721892Z         else:
2025-05-07T20:31:57.6721996Z             scale_ub_tensor = None
2025-05-07T20:31:57.6722073Z     
2025-05-07T20:31:57.6722207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6722296Z             op = silu_mul_quant
2025-05-07T20:31:57.6722386Z             if compiled:
2025-05-07T20:31:57.6722486Z                 op = torch.compile(op)
2025-05-07T20:31:57.6722597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6722675Z     
2025-05-07T20:31:57.6722764Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6722769Z 
2025-05-07T20:31:57.6722868Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6723094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6723196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6723302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6723679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6723771Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6724286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6724384Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6724753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6724991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6725347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6725447Z     kernel = self.compile(
2025-05-07T20:31:57.6725841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6726019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6726157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6726161Z 
2025-05-07T20:31:57.6726370Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a1ca0>
2025-05-07T20:31:57.6727180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6727777Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4486dab060>}
2025-05-07T20:31:57.6728548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6728748Z context = <triton._C.libtriton.ir.context object at 0x7f447578d6f0>
2025-05-07T20:31:57.6728752Z 
2025-05-07T20:31:57.6728920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6729198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6729306Z                            module_map=module_map)
2025-05-07T20:31:57.6729471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6729583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6729662Z E       ^
2025-05-07T20:31:57.6730031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6730042Z 
2025-05-07T20:31:57.6730469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6730474Z 
2025-05-07T20:31:57.6730578Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6730817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6730894Z     T=4096,
2025-05-07T20:31:57.6730974Z     D=7168,
2025-05-07T20:31:57.6731063Z     scale_ub=None,
2025-05-07T20:31:57.6731149Z     contiguous=False,
2025-05-07T20:31:57.6731233Z     compiled=True,
2025-05-07T20:31:57.6731313Z )
2025-05-07T20:31:57.6731536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6731726Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6731730Z 
2025-05-07T20:31:57.6731897Z     @given(
2025-05-07T20:31:57.6732101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6732206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6732321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6732438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6732557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6732631Z     )
2025-05-07T20:31:57.6732884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6732982Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6733059Z         self,
2025-05-07T20:31:57.6733141Z         T: int,
2025-05-07T20:31:57.6733217Z         D: int,
2025-05-07T20:31:57.6733313Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6733408Z         contiguous: bool,
2025-05-07T20:31:57.6733496Z         compiled: bool,
2025-05-07T20:31:57.6733574Z     ) -> None:
2025-05-07T20:31:57.6733675Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6733746Z     
2025-05-07T20:31:57.6733921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6734001Z     
2025-05-07T20:31:57.6734093Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6734218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6734314Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6734394Z         x0 = x[:, :D]
2025-05-07T20:31:57.6734479Z         x1 = x[:, D:]
2025-05-07T20:31:57.6734553Z     
2025-05-07T20:31:57.6734636Z         if contiguous:
2025-05-07T20:31:57.6734736Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6734824Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6734896Z     
2025-05-07T20:31:57.6734991Z         if scale_ub is not None:
2025-05-07T20:31:57.6735096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6735317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6735400Z             )
2025-05-07T20:31:57.6735476Z         else:
2025-05-07T20:31:57.6735570Z             scale_ub_tensor = None
2025-05-07T20:31:57.6735653Z     
2025-05-07T20:31:57.6735783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6735880Z             op = silu_mul_quant
2025-05-07T20:31:57.6735965Z             if compiled:
2025-05-07T20:31:57.6736066Z                 op = torch.compile(op)
2025-05-07T20:31:57.6736179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6736251Z     
2025-05-07T20:31:57.6736343Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6736347Z 
2025-05-07T20:31:57.6736449Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6736583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6736686Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6736794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6737175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6737273Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6737786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6737882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6738255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6738481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6738834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6738937Z     kernel = self.compile(
2025-05-07T20:31:57.6739330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6739523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6739659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6739743Z 
2025-05-07T20:31:57.6739956Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a1880>
2025-05-07T20:31:57.6740765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6741287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484df04a0>}
2025-05-07T20:31:57.6742065Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6742262Z context = <triton._C.libtriton.ir.context object at 0x7f447575ccf0>
2025-05-07T20:31:57.6742266Z 
2025-05-07T20:31:57.6742445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6742717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6742827Z                            module_map=module_map)
2025-05-07T20:31:57.6742999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6743098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6743177Z E       ^
2025-05-07T20:31:57.6743547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6743551Z 
2025-05-07T20:31:57.6743978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6744086Z 
2025-05-07T20:31:57.6744198Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6744426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6744503Z     T=16384,
2025-05-07T20:31:57.6744594Z     D=5120,
2025-05-07T20:31:57.6744677Z     scale_ub=1200.0,
2025-05-07T20:31:57.6744763Z     contiguous=False,
2025-05-07T20:31:57.6744856Z     compiled=False,
2025-05-07T20:31:57.6744928Z )
2025-05-07T20:31:57.6745152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6745346Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6745350Z 
2025-05-07T20:31:57.6745425Z     @given(
2025-05-07T20:31:57.6745554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6745652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6745767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6745894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6746014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6746090Z     )
2025-05-07T20:31:57.6746354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6746453Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6746534Z         self,
2025-05-07T20:31:57.6746611Z         T: int,
2025-05-07T20:31:57.6746688Z         D: int,
2025-05-07T20:31:57.6746791Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6746880Z         contiguous: bool,
2025-05-07T20:31:57.6746964Z         compiled: bool,
2025-05-07T20:31:57.6747049Z     ) -> None:
2025-05-07T20:31:57.6747142Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6747213Z     
2025-05-07T20:31:57.6747389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6747461Z     
2025-05-07T20:31:57.6747553Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6747683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6747775Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6747855Z         x0 = x[:, :D]
2025-05-07T20:31:57.6747941Z         x1 = x[:, D:]
2025-05-07T20:31:57.6748013Z     
2025-05-07T20:31:57.6748184Z         if contiguous:
2025-05-07T20:31:57.6748276Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6748364Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6748441Z     
2025-05-07T20:31:57.6748531Z         if scale_ub is not None:
2025-05-07T20:31:57.6748639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6748784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6748858Z             )
2025-05-07T20:31:57.6748936Z         else:
2025-05-07T20:31:57.6749037Z             scale_ub_tensor = None
2025-05-07T20:31:57.6749108Z     
2025-05-07T20:31:57.6749238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6749334Z             op = silu_mul_quant
2025-05-07T20:31:57.6749418Z             if compiled:
2025-05-07T20:31:57.6749527Z                 op = torch.compile(op)
2025-05-07T20:31:57.6749633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6749705Z     
2025-05-07T20:31:57.6749802Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6749811Z 
2025-05-07T20:31:57.6749909Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6750041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6750146Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6750244Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6750755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6750858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6751225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6751457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6751887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6751994Z     kernel = self.compile(
2025-05-07T20:31:57.6752435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6752614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6752755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6752760Z 
2025-05-07T20:31:57.6752968Z self = <triton.compiler.compiler.ASTSource object at 0x7f44779a33b0>
2025-05-07T20:31:57.6753771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6754297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44a9babd80>}
2025-05-07T20:31:57.6755072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6755274Z context = <triton._C.libtriton.ir.context object at 0x7f44756dfdf0>
2025-05-07T20:31:57.6755279Z 
2025-05-07T20:31:57.6755445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6755716Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6755830Z                            module_map=module_map)
2025-05-07T20:31:57.6755994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6756116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6756201Z E       ^
2025-05-07T20:31:57.6756567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6756571Z 
2025-05-07T20:31:57.6757090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6757095Z 
2025-05-07T20:31:57.6757200Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6763824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6763922Z     T=16384,
2025-05-07T20:31:57.6764012Z     D=5120,
2025-05-07T20:31:57.6764097Z     scale_ub=1200.0,
2025-05-07T20:31:57.6764182Z     contiguous=True,
2025-05-07T20:31:57.6764273Z     compiled=True,
2025-05-07T20:31:57.6764347Z )
2025-05-07T20:31:57.6764589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6764772Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6764789Z 
2025-05-07T20:31:57.6764871Z     @given(
2025-05-07T20:31:57.6765002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6765104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6765227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6765357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6765478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6765564Z     )
2025-05-07T20:31:57.6765825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6765922Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6766009Z         self,
2025-05-07T20:31:57.6766091Z         T: int,
2025-05-07T20:31:57.6766170Z         D: int,
2025-05-07T20:31:57.6766279Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6766373Z         contiguous: bool,
2025-05-07T20:31:57.6766460Z         compiled: bool,
2025-05-07T20:31:57.6766550Z     ) -> None:
2025-05-07T20:31:57.6766768Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6766847Z     
2025-05-07T20:31:57.6767035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6767110Z     
2025-05-07T20:31:57.6767211Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6767349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6767441Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6767533Z         x0 = x[:, :D]
2025-05-07T20:31:57.6767615Z         x1 = x[:, D:]
2025-05-07T20:31:57.6767691Z     
2025-05-07T20:31:57.6767789Z         if contiguous:
2025-05-07T20:31:57.6767884Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6767980Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6768067Z     
2025-05-07T20:31:57.6768161Z         if scale_ub is not None:
2025-05-07T20:31:57.6768269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6768414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6768496Z             )
2025-05-07T20:31:57.6768579Z         else:
2025-05-07T20:31:57.6768676Z             scale_ub_tensor = None
2025-05-07T20:31:57.6768748Z     
2025-05-07T20:31:57.6768889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6768989Z             op = silu_mul_quant
2025-05-07T20:31:57.6769076Z             if compiled:
2025-05-07T20:31:57.6769185Z                 op = torch.compile(op)
2025-05-07T20:31:57.6769295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6769368Z     
2025-05-07T20:31:57.6769469Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6769474Z 
2025-05-07T20:31:57.6769577Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6769721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6769823Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6769928Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6770323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6770422Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6771020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6771124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6771496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6771733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6772204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6772300Z     kernel = self.compile(
2025-05-07T20:31:57.6772706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6772888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6773031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6773035Z 
2025-05-07T20:31:57.6773246Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484417d40>
2025-05-07T20:31:57.6774069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6774601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4484d772e0>}
2025-05-07T20:31:57.6775385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6775587Z context = <triton._C.libtriton.ir.context object at 0x7f44756e3370>
2025-05-07T20:31:57.6775672Z 
2025-05-07T20:31:57.6775843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6776125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6776239Z                            module_map=module_map)
2025-05-07T20:31:57.6776405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6776509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6776586Z E       ^
2025-05-07T20:31:57.6776957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6776962Z 
2025-05-07T20:31:57.6777402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6777406Z 
2025-05-07T20:31:57.6777509Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6777744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6777826Z     T=16384,
2025-05-07T20:31:57.6777901Z     D=5120,
2025-05-07T20:31:57.6777988Z     scale_ub=None,
2025-05-07T20:31:57.6778074Z     contiguous=False,
2025-05-07T20:31:57.6778160Z     compiled=True,
2025-05-07T20:31:57.6778241Z )
2025-05-07T20:31:57.6778466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6778647Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6778651Z 
2025-05-07T20:31:57.6778733Z     @given(
2025-05-07T20:31:57.6778852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6778956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6779073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6779191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6779311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6779390Z     )
2025-05-07T20:31:57.6779643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6779742Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6779819Z         self,
2025-05-07T20:31:57.6780011Z         T: int,
2025-05-07T20:31:57.6780095Z         D: int,
2025-05-07T20:31:57.6780192Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6780281Z         contiguous: bool,
2025-05-07T20:31:57.6780372Z         compiled: bool,
2025-05-07T20:31:57.6780450Z     ) -> None:
2025-05-07T20:31:57.6780549Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6780628Z     
2025-05-07T20:31:57.6780798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6780876Z     
2025-05-07T20:31:57.6780966Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6781092Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6781184Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6781268Z         x0 = x[:, :D]
2025-05-07T20:31:57.6781354Z         x1 = x[:, D:]
2025-05-07T20:31:57.6781430Z     
2025-05-07T20:31:57.6781514Z         if contiguous:
2025-05-07T20:31:57.6781605Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6781699Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6781777Z     
2025-05-07T20:31:57.6781867Z         if scale_ub is not None:
2025-05-07T20:31:57.6781977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6782115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6782195Z             )
2025-05-07T20:31:57.6782270Z         else:
2025-05-07T20:31:57.6782363Z             scale_ub_tensor = None
2025-05-07T20:31:57.6782440Z     
2025-05-07T20:31:57.6782570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6782660Z             op = silu_mul_quant
2025-05-07T20:31:57.6782752Z             if compiled:
2025-05-07T20:31:57.6782852Z                 op = torch.compile(op)
2025-05-07T20:31:57.6782958Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6783122Z     
2025-05-07T20:31:57.6783214Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6783218Z 
2025-05-07T20:31:57.6783322Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6783460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6783563Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6783673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6784052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6784146Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6784666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6784763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6785141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6785375Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6785731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6785837Z     kernel = self.compile(
2025-05-07T20:31:57.6786234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6786414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6786557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6786561Z 
2025-05-07T20:31:57.6786770Z self = <triton.compiler.compiler.ASTSource object at 0x7f4484012480>
2025-05-07T20:31:57.6787590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6788118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447608f920>}
2025-05-07T20:31:57.6788983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6789180Z context = <triton._C.libtriton.ir.context object at 0x7f4474b09530>
2025-05-07T20:31:57.6789185Z 
2025-05-07T20:31:57.6789356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6789637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6789745Z                            module_map=module_map)
2025-05-07T20:31:57.6789918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6790023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6790101Z E       ^
2025-05-07T20:31:57.6790479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6790488Z 
2025-05-07T20:31:57.6790919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6790924Z 
2025-05-07T20:31:57.6791029Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6791266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6791343Z     T=2048,
2025-05-07T20:31:57.6791426Z     D=5120,
2025-05-07T20:31:57.6791508Z     scale_ub=None,
2025-05-07T20:31:57.6791592Z     contiguous=False,
2025-05-07T20:31:57.6791681Z     compiled=True,
2025-05-07T20:31:57.6791752Z )
2025-05-07T20:31:57.6791979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6792165Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.6792247Z 
2025-05-07T20:31:57.6792323Z     @given(
2025-05-07T20:31:57.6792452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6792581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6792721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6792845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6792960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6793031Z     )
2025-05-07T20:31:57.6793292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6793386Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6793460Z         self,
2025-05-07T20:31:57.6793543Z         T: int,
2025-05-07T20:31:57.6793620Z         D: int,
2025-05-07T20:31:57.6793721Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6793818Z         contiguous: bool,
2025-05-07T20:31:57.6793902Z         compiled: bool,
2025-05-07T20:31:57.6793987Z     ) -> None:
2025-05-07T20:31:57.6794087Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6794159Z     
2025-05-07T20:31:57.6794340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6794418Z     
2025-05-07T20:31:57.6794510Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6794641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6794729Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6794808Z         x0 = x[:, :D]
2025-05-07T20:31:57.6794894Z         x1 = x[:, D:]
2025-05-07T20:31:57.6794967Z     
2025-05-07T20:31:57.6795051Z         if contiguous:
2025-05-07T20:31:57.6795149Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6795238Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6795309Z     
2025-05-07T20:31:57.6795403Z         if scale_ub is not None:
2025-05-07T20:31:57.6795509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6795653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6795732Z             )
2025-05-07T20:31:57.6795811Z         else:
2025-05-07T20:31:57.6795909Z             scale_ub_tensor = None
2025-05-07T20:31:57.6795981Z     
2025-05-07T20:31:57.6796198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6796298Z             op = silu_mul_quant
2025-05-07T20:31:57.6796383Z             if compiled:
2025-05-07T20:31:57.6796484Z                 op = torch.compile(op)
2025-05-07T20:31:57.6796597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6796669Z     
2025-05-07T20:31:57.6796760Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6796764Z 
2025-05-07T20:31:57.6796867Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6797000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6797105Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6797208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6797586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6797691Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6798208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6798305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6798682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6798912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6799272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6799366Z     kernel = self.compile(
2025-05-07T20:31:57.6799762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6800035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6800166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6800171Z 
2025-05-07T20:31:57.6800391Z self = <triton.compiler.compiler.ASTSource object at 0x7f44849c6450>
2025-05-07T20:31:57.6801202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6801721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447608df80>}
2025-05-07T20:31:57.6802540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6802757Z context = <triton._C.libtriton.ir.context object at 0x7f44758dfdf0>
2025-05-07T20:31:57.6802762Z 
2025-05-07T20:31:57.6802937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6803214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6803322Z                            module_map=module_map)
2025-05-07T20:31:57.6803495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6803593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6803678Z E       ^
2025-05-07T20:31:57.6804045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6804050Z 
2025-05-07T20:31:57.6804480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6804485Z 
2025-05-07T20:31:57.6804598Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6804828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6804908Z     T=2048,
2025-05-07T20:31:57.6804990Z     D=5120,
2025-05-07T20:31:57.6805156Z     scale_ub=1200.0,
2025-05-07T20:31:57.6805252Z     contiguous=False,
2025-05-07T20:31:57.6805336Z     compiled=True,
2025-05-07T20:31:57.6805409Z )
2025-05-07T20:31:57.6805641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6805826Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6805830Z 
2025-05-07T20:31:57.6805906Z     @given(
2025-05-07T20:31:57.6806034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6806137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6807132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6807284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6807416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6807497Z     )
2025-05-07T20:31:57.6807757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6807849Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6807938Z         self,
2025-05-07T20:31:57.6808017Z         T: int,
2025-05-07T20:31:57.6808093Z         D: int,
2025-05-07T20:31:57.6808199Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6808289Z         contiguous: bool,
2025-05-07T20:31:57.6808376Z         compiled: bool,
2025-05-07T20:31:57.6808467Z     ) -> None:
2025-05-07T20:31:57.6808561Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6808632Z     
2025-05-07T20:31:57.6808809Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6808883Z     
2025-05-07T20:31:57.6808980Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6809105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6809194Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6809592Z         x0 = x[:, :D]
2025-05-07T20:31:57.6809672Z         x1 = x[:, D:]
2025-05-07T20:31:57.6809744Z     
2025-05-07T20:31:57.6809834Z         if contiguous:
2025-05-07T20:31:57.6809931Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6810021Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6810101Z     
2025-05-07T20:31:57.6810191Z         if scale_ub is not None:
2025-05-07T20:31:57.6810298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6810441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6810516Z             )
2025-05-07T20:31:57.6810591Z         else:
2025-05-07T20:31:57.6810692Z             scale_ub_tensor = None
2025-05-07T20:31:57.6810765Z     
2025-05-07T20:31:57.6810902Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6810993Z             op = silu_mul_quant
2025-05-07T20:31:57.6811076Z             if compiled:
2025-05-07T20:31:57.6811183Z                 op = torch.compile(op)
2025-05-07T20:31:57.6811295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6811367Z     
2025-05-07T20:31:57.6811466Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6811471Z 
2025-05-07T20:31:57.6811575Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6811708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6811916Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6812019Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6812407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6812502Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6813015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6813120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6813490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6813725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6814234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6814330Z     kernel = self.compile(
2025-05-07T20:31:57.6814731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6814909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6815040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6815045Z 
2025-05-07T20:31:57.6815258Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d1dfa0>
2025-05-07T20:31:57.6816062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6816595Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476fdca40>}
2025-05-07T20:31:57.6817371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6817568Z context = <triton._C.libtriton.ir.context object at 0x7f4475849670>
2025-05-07T20:31:57.6817573Z 
2025-05-07T20:31:57.6817740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6818010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6818122Z                            module_map=module_map)
2025-05-07T20:31:57.6818286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6818498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6818583Z E       ^
2025-05-07T20:31:57.6818953Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6818958Z 
2025-05-07T20:31:57.6819393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6819398Z 
2025-05-07T20:31:57.6819501Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6819732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6819815Z     T=4096,
2025-05-07T20:31:57.6819892Z     D=5120,
2025-05-07T20:31:57.6819976Z     scale_ub=1200.0,
2025-05-07T20:31:57.6820072Z     contiguous=True,
2025-05-07T20:31:57.6820154Z     compiled=True,
2025-05-07T20:31:57.6820239Z )
2025-05-07T20:31:57.6820462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6820647Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6820652Z 
2025-05-07T20:31:57.6820736Z     @given(
2025-05-07T20:31:57.6820862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6820961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6821084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6821201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6821316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6821398Z     )
2025-05-07T20:31:57.6821651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6821750Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6821828Z         self,
2025-05-07T20:31:57.6821906Z         T: int,
2025-05-07T20:31:57.6821992Z         D: int,
2025-05-07T20:31:57.6822091Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6822186Z         contiguous: bool,
2025-05-07T20:31:57.6822278Z         compiled: bool,
2025-05-07T20:31:57.6822361Z     ) -> None:
2025-05-07T20:31:57.6822455Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6822538Z     
2025-05-07T20:31:57.6822851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6822928Z     
2025-05-07T20:31:57.6823028Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6823153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6823248Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6823328Z         x0 = x[:, :D]
2025-05-07T20:31:57.6823407Z         x1 = x[:, D:]
2025-05-07T20:31:57.6823483Z     
2025-05-07T20:31:57.6823566Z         if contiguous:
2025-05-07T20:31:57.6823657Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6823754Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6823826Z     
2025-05-07T20:31:57.6823936Z         if scale_ub is not None:
2025-05-07T20:31:57.6824041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6824182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6824265Z             )
2025-05-07T20:31:57.6824341Z         else:
2025-05-07T20:31:57.6824440Z             scale_ub_tensor = None
2025-05-07T20:31:57.6824521Z     
2025-05-07T20:31:57.6824652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6824743Z             op = silu_mul_quant
2025-05-07T20:31:57.6824835Z             if compiled:
2025-05-07T20:31:57.6824934Z                 op = torch.compile(op)
2025-05-07T20:31:57.6825051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6825123Z     
2025-05-07T20:31:57.6825217Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6825221Z 
2025-05-07T20:31:57.6825325Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6825457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6825558Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6825748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6826126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6826219Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6826744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6826840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6827212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6827439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6827790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6827890Z     kernel = self.compile(
2025-05-07T20:31:57.6828285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6828473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6828603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6828613Z 
2025-05-07T20:31:57.6828819Z self = <triton.compiler.compiler.ASTSource object at 0x7f4486d9ae10>
2025-05-07T20:31:57.6829631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6830147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4476fde2a0>}
2025-05-07T20:31:57.6830929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6831128Z context = <triton._C.libtriton.ir.context object at 0x7f4474df7bf0>
2025-05-07T20:31:57.6831133Z 
2025-05-07T20:31:57.6831377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6831665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6831792Z                            module_map=module_map)
2025-05-07T20:31:57.6831986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6832085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6832161Z E       ^
2025-05-07T20:31:57.6832537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6832542Z 
2025-05-07T20:31:57.6832974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6832983Z 
2025-05-07T20:31:57.6833092Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6833323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6833399Z     T=128,
2025-05-07T20:31:57.6833485Z     D=5120,
2025-05-07T20:31:57.6833569Z     scale_ub=1200.0,
2025-05-07T20:31:57.6833653Z     contiguous=False,
2025-05-07T20:31:57.6833741Z     compiled=True,
2025-05-07T20:31:57.6833813Z )
2025-05-07T20:31:57.6834038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6834220Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6834226Z 
2025-05-07T20:31:57.6834302Z     @given(
2025-05-07T20:31:57.6834428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6834528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6834644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6834767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6834963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6835037Z     )
2025-05-07T20:31:57.6835301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6835395Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6835472Z         self,
2025-05-07T20:31:57.6835553Z         T: int,
2025-05-07T20:31:57.6835629Z         D: int,
2025-05-07T20:31:57.6835734Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6835823Z         contiguous: bool,
2025-05-07T20:31:57.6835909Z         compiled: bool,
2025-05-07T20:31:57.6835994Z     ) -> None:
2025-05-07T20:31:57.6836088Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6836160Z     
2025-05-07T20:31:57.6836336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6836410Z     
2025-05-07T20:31:57.6836502Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6836633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6836730Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6836810Z         x0 = x[:, :D]
2025-05-07T20:31:57.6836896Z         x1 = x[:, D:]
2025-05-07T20:31:57.6836969Z     
2025-05-07T20:31:57.6837056Z         if contiguous:
2025-05-07T20:31:57.6837155Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6837244Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6837323Z     
2025-05-07T20:31:57.6837413Z         if scale_ub is not None:
2025-05-07T20:31:57.6837518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6837665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6837740Z             )
2025-05-07T20:31:57.6837818Z         else:
2025-05-07T20:31:57.6837919Z             scale_ub_tensor = None
2025-05-07T20:31:57.6837990Z     
2025-05-07T20:31:57.6838121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6838221Z             op = silu_mul_quant
2025-05-07T20:31:57.6838306Z             if compiled:
2025-05-07T20:31:57.6838410Z                 op = torch.compile(op)
2025-05-07T20:31:57.6838522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6838593Z     
2025-05-07T20:31:57.6838692Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6838777Z 
2025-05-07T20:31:57.6838878Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6839009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6839117Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6839217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6839597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6839696Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6840216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6840321Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6840697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6840927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6841293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6841390Z     kernel = self.compile(
2025-05-07T20:31:57.6841791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6841976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6842105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6842110Z 
2025-05-07T20:31:57.6842325Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487655520>
2025-05-07T20:31:57.6843146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6843760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44779540e0>}
2025-05-07T20:31:57.6844541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6844733Z context = <triton._C.libtriton.ir.context object at 0x7f4474db2cb0>
2025-05-07T20:31:57.6844738Z 
2025-05-07T20:31:57.6844912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6845187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6845302Z                            module_map=module_map)
2025-05-07T20:31:57.6845472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6845572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6845659Z E       ^
2025-05-07T20:31:57.6846030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6846034Z 
2025-05-07T20:31:57.6846462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6846475Z 
2025-05-07T20:31:57.6846577Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6846807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6846892Z     T=16384,
2025-05-07T20:31:57.6846969Z     D=7168,
2025-05-07T20:31:57.6847052Z     scale_ub=1200.0,
2025-05-07T20:31:57.6847144Z     contiguous=True,
2025-05-07T20:31:57.6847226Z     compiled=True,
2025-05-07T20:31:57.6847304Z )
2025-05-07T20:31:57.6847536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6847716Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6847721Z 
2025-05-07T20:31:57.6847906Z     @given(
2025-05-07T20:31:57.6848034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6848133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6848253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6848369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6848482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6848561Z     )
2025-05-07T20:31:57.6848812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6848905Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6848988Z         self,
2025-05-07T20:31:57.6849065Z         T: int,
2025-05-07T20:31:57.6849141Z         D: int,
2025-05-07T20:31:57.6849249Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6849337Z         contiguous: bool,
2025-05-07T20:31:57.6849423Z         compiled: bool,
2025-05-07T20:31:57.6849508Z     ) -> None:
2025-05-07T20:31:57.6849606Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6849684Z     
2025-05-07T20:31:57.6849852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6849924Z     
2025-05-07T20:31:57.6850022Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6850146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6850234Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6850320Z         x0 = x[:, :D]
2025-05-07T20:31:57.6850399Z         x1 = x[:, D:]
2025-05-07T20:31:57.6850472Z     
2025-05-07T20:31:57.6850563Z         if contiguous:
2025-05-07T20:31:57.6850655Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6850744Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6850821Z     
2025-05-07T20:31:57.6850911Z         if scale_ub is not None:
2025-05-07T20:31:57.6851103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6851239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6851312Z             )
2025-05-07T20:31:57.6851398Z         else:
2025-05-07T20:31:57.6851492Z             scale_ub_tensor = None
2025-05-07T20:31:57.6851564Z     
2025-05-07T20:31:57.6851698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6851877Z             op = silu_mul_quant
2025-05-07T20:31:57.6851964Z             if compiled:
2025-05-07T20:31:57.6852071Z                 op = torch.compile(op)
2025-05-07T20:31:57.6852175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6852246Z     
2025-05-07T20:31:57.6852341Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6852346Z 
2025-05-07T20:31:57.6852443Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6852580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6852690Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6852789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6853170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6853267Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6853777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6853881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6854248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6854483Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6854835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6854930Z     kernel = self.compile(
2025-05-07T20:31:57.6855337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6855517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6855734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6855747Z 
2025-05-07T20:31:57.6855955Z self = <triton.compiler.compiler.ASTSource object at 0x7f4487629160>
2025-05-07T20:31:57.6856763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6857284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477956160>}
2025-05-07T20:31:57.6858057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6858262Z context = <triton._C.libtriton.ir.context object at 0x7f4474ddc730>
2025-05-07T20:31:57.6858270Z 
2025-05-07T20:31:57.6858437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6858708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6858821Z                            module_map=module_map)
2025-05-07T20:31:57.6858989Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6859096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6859176Z E       ^
2025-05-07T20:31:57.6859541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6859545Z 
2025-05-07T20:31:57.6859977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6860059Z 
2025-05-07T20:31:57.6860165Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6860405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6860484Z     T=16384,
2025-05-07T20:31:57.6860563Z     D=5120,
2025-05-07T20:31:57.6860655Z     scale_ub=1200.0,
2025-05-07T20:31:57.6860741Z     contiguous=True,
2025-05-07T20:31:57.6860828Z     compiled=False,
2025-05-07T20:31:57.6860908Z )
2025-05-07T20:31:57.6861132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6861316Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.6861320Z 
2025-05-07T20:31:57.6861406Z     @given(
2025-05-07T20:31:57.6861526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6861627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6861750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6861876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6862001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6862079Z     )
2025-05-07T20:31:57.6862337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6862440Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6862518Z         self,
2025-05-07T20:31:57.6862596Z         T: int,
2025-05-07T20:31:57.6862679Z         D: int,
2025-05-07T20:31:57.6862777Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6862867Z         contiguous: bool,
2025-05-07T20:31:57.6862960Z         compiled: bool,
2025-05-07T20:31:57.6863040Z     ) -> None:
2025-05-07T20:31:57.6863135Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6863217Z     
2025-05-07T20:31:57.6863387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6863467Z     
2025-05-07T20:31:57.6863560Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6863688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6863783Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6863865Z         x0 = x[:, :D]
2025-05-07T20:31:57.6864028Z         x1 = x[:, D:]
2025-05-07T20:31:57.6864112Z     
2025-05-07T20:31:57.6864199Z         if contiguous:
2025-05-07T20:31:57.6864292Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6864389Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6864462Z     
2025-05-07T20:31:57.6864552Z         if scale_ub is not None:
2025-05-07T20:31:57.6864666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6864802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6864883Z             )
2025-05-07T20:31:57.6864961Z         else:
2025-05-07T20:31:57.6865055Z             scale_ub_tensor = None
2025-05-07T20:31:57.6865134Z     
2025-05-07T20:31:57.6865265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6865362Z             op = silu_mul_quant
2025-05-07T20:31:57.6865454Z             if compiled:
2025-05-07T20:31:57.6865555Z                 op = torch.compile(op)
2025-05-07T20:31:57.6865664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6865749Z     
2025-05-07T20:31:57.6865844Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6865849Z 
2025-05-07T20:31:57.6865946Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6866084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6866186Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6866292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6866806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6866908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6867284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6867591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6867957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6868054Z     kernel = self.compile(
2025-05-07T20:31:57.6868451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6868638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6868771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6868777Z 
2025-05-07T20:31:57.6868988Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762c46e0>
2025-05-07T20:31:57.6869796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6870315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477847d80>}
2025-05-07T20:31:57.6871100Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6871295Z context = <triton._C.libtriton.ir.context object at 0x7f4474c21a30>
2025-05-07T20:31:57.6871300Z 
2025-05-07T20:31:57.6871473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6871744Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6871851Z                            module_map=module_map)
2025-05-07T20:31:57.6872023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6872129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6872208Z E       ^
2025-05-07T20:31:57.6872579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6872660Z 
2025-05-07T20:31:57.6873090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6873095Z 
2025-05-07T20:31:57.6873208Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6873437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6873516Z     T=1,
2025-05-07T20:31:57.6873601Z     D=7168,
2025-05-07T20:31:57.6873685Z     scale_ub=1200.0,
2025-05-07T20:31:57.6873775Z     contiguous=False,
2025-05-07T20:31:57.6873867Z     compiled=False,
2025-05-07T20:31:57.6873941Z )
2025-05-07T20:31:57.6874173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6874350Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.6874354Z 
2025-05-07T20:31:57.6874433Z     @given(
2025-05-07T20:31:57.6874560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6874665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6874782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6874912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6875027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6875104Z     )
2025-05-07T20:31:57.6875364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6875459Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6875544Z         self,
2025-05-07T20:31:57.6875623Z         T: int,
2025-05-07T20:31:57.6875701Z         D: int,
2025-05-07T20:31:57.6875806Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6875897Z         contiguous: bool,
2025-05-07T20:31:57.6876067Z         compiled: bool,
2025-05-07T20:31:57.6876157Z     ) -> None:
2025-05-07T20:31:57.6876253Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6876328Z     
2025-05-07T20:31:57.6876510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6876586Z     
2025-05-07T20:31:57.6876678Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6876810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6876899Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6876984Z         x0 = x[:, :D]
2025-05-07T20:31:57.6877066Z         x1 = x[:, D:]
2025-05-07T20:31:57.6877139Z     
2025-05-07T20:31:57.6877229Z         if contiguous:
2025-05-07T20:31:57.6877321Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6877412Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6877492Z     
2025-05-07T20:31:57.6877584Z         if scale_ub is not None:
2025-05-07T20:31:57.6877690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6877832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6877915Z             )
2025-05-07T20:31:57.6877992Z         else:
2025-05-07T20:31:57.6878094Z             scale_ub_tensor = None
2025-05-07T20:31:57.6878168Z     
2025-05-07T20:31:57.6878303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6878400Z             op = silu_mul_quant
2025-05-07T20:31:57.6878486Z             if compiled:
2025-05-07T20:31:57.6878590Z                 op = torch.compile(op)
2025-05-07T20:31:57.6878696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6878769Z     
2025-05-07T20:31:57.6878867Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6878871Z 
2025-05-07T20:31:57.6878968Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6879100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6879210Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6879310Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6879828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6879934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6880410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6880648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6880999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6881095Z     kernel = self.compile(
2025-05-07T20:31:57.6881494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6881671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6881806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6881816Z 
2025-05-07T20:31:57.6882025Z self = <triton.compiler.compiler.ASTSource object at 0x7f44762c4080>
2025-05-07T20:31:57.6882836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6883359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4477e7fce0>}
2025-05-07T20:31:57.6884133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6884333Z context = <triton._C.libtriton.ir.context object at 0x7f4474ce5bf0>
2025-05-07T20:31:57.6884337Z 
2025-05-07T20:31:57.6884505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6884852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6884965Z                            module_map=module_map)
2025-05-07T20:31:57.6885135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6885241Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6885319Z E       ^
2025-05-07T20:31:57.6885681Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6885686Z 
2025-05-07T20:31:57.6886118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6886122Z 
2025-05-07T20:31:57.6886227Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6886463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6886541Z     T=4096,
2025-05-07T20:31:57.6886622Z     D=7168,
2025-05-07T20:31:57.6886713Z     scale_ub=1200.0,
2025-05-07T20:31:57.6886801Z     contiguous=False,
2025-05-07T20:31:57.6886885Z     compiled=True,
2025-05-07T20:31:57.6886979Z )
2025-05-07T20:31:57.6887207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6887392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6887396Z 
2025-05-07T20:31:57.6887482Z     @given(
2025-05-07T20:31:57.6887603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6894566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6894708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6894837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6894952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6895034Z     )
2025-05-07T20:31:57.6895294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6895399Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6895483Z         self,
2025-05-07T20:31:57.6895561Z         T: int,
2025-05-07T20:31:57.6895640Z         D: int,
2025-05-07T20:31:57.6895864Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6895957Z         contiguous: bool,
2025-05-07T20:31:57.6896043Z         compiled: bool,
2025-05-07T20:31:57.6896133Z     ) -> None:
2025-05-07T20:31:57.6896229Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6896304Z     
2025-05-07T20:31:57.6896488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6896567Z     
2025-05-07T20:31:57.6896669Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6896797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6896887Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6896977Z         x0 = x[:, :D]
2025-05-07T20:31:57.6897057Z         x1 = x[:, D:]
2025-05-07T20:31:57.6897130Z     
2025-05-07T20:31:57.6897227Z         if contiguous:
2025-05-07T20:31:57.6897320Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6897411Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6897495Z     
2025-05-07T20:31:57.6897586Z         if scale_ub is not None:
2025-05-07T20:31:57.6897698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6897846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6897923Z             )
2025-05-07T20:31:57.6898007Z         else:
2025-05-07T20:31:57.6898103Z             scale_ub_tensor = None
2025-05-07T20:31:57.6898177Z     
2025-05-07T20:31:57.6898321Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6898412Z             op = silu_mul_quant
2025-05-07T20:31:57.6898498Z             if compiled:
2025-05-07T20:31:57.6898606Z                 op = torch.compile(op)
2025-05-07T20:31:57.6898714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6898787Z     
2025-05-07T20:31:57.6898882Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6898971Z 
2025-05-07T20:31:57.6899072Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6899212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6899319Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6899420Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6899813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6899907Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6900426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6900523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6900893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6901129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6901483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6901579Z     kernel = self.compile(
2025-05-07T20:31:57.6901987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6902193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6902355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6902360Z 
2025-05-07T20:31:57.6902570Z self = <triton.compiler.compiler.ASTSource object at 0x7f44767620c0>
2025-05-07T20:31:57.6903378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6903904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44c6ff5f80>}
2025-05-07T20:31:57.6904758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6904962Z context = <triton._C.libtriton.ir.context object at 0x7f4474b25070>
2025-05-07T20:31:57.6904967Z 
2025-05-07T20:31:57.6905136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6905414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6905523Z                            module_map=module_map)
2025-05-07T20:31:57.6905688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6905793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6905870Z E       ^
2025-05-07T20:31:57.6906605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6906614Z 
2025-05-07T20:31:57.6907110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6907116Z 
2025-05-07T20:31:57.6907221Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6907459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6907536Z     T=128,
2025-05-07T20:31:57.6907615Z     D=7168,
2025-05-07T20:31:57.6907704Z     scale_ub=1200.0,
2025-05-07T20:31:57.6907790Z     contiguous=False,
2025-05-07T20:31:57.6907873Z     compiled=True,
2025-05-07T20:31:57.6907955Z )
2025-05-07T20:31:57.6908179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6908355Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.6908360Z 
2025-05-07T20:31:57.6908679Z     @given(
2025-05-07T20:31:57.6908799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6908904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6909019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6909142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6909262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6909335Z     )
2025-05-07T20:31:57.6909586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6909684Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6909760Z         self,
2025-05-07T20:31:57.6909839Z         T: int,
2025-05-07T20:31:57.6909921Z         D: int,
2025-05-07T20:31:57.6910018Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6910111Z         contiguous: bool,
2025-05-07T20:31:57.6910196Z         compiled: bool,
2025-05-07T20:31:57.6910273Z     ) -> None:
2025-05-07T20:31:57.6910374Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6910451Z     
2025-05-07T20:31:57.6910623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6910704Z     
2025-05-07T20:31:57.6910795Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6910928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6911023Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6911102Z         x0 = x[:, :D]
2025-05-07T20:31:57.6911181Z         x1 = x[:, D:]
2025-05-07T20:31:57.6911262Z     
2025-05-07T20:31:57.6911347Z         if contiguous:
2025-05-07T20:31:57.6911438Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6911534Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6911607Z     
2025-05-07T20:31:57.6911703Z         if scale_ub is not None:
2025-05-07T20:31:57.6911806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6911942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6912021Z             )
2025-05-07T20:31:57.6912103Z         else:
2025-05-07T20:31:57.6912196Z             scale_ub_tensor = None
2025-05-07T20:31:57.6912273Z     
2025-05-07T20:31:57.6912403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6912493Z             op = silu_mul_quant
2025-05-07T20:31:57.6912717Z             if compiled:
2025-05-07T20:31:57.6912819Z                 op = torch.compile(op)
2025-05-07T20:31:57.6912926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6913004Z     
2025-05-07T20:31:57.6913094Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6913098Z 
2025-05-07T20:31:57.6913200Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6913335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6913435Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6913540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6914016Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6914537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6914639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6915012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6915239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6915589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6915688Z     kernel = self.compile(
2025-05-07T20:31:57.6916084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6916262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6916403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6916486Z 
2025-05-07T20:31:57.6916695Z self = <triton.compiler.compiler.ASTSource object at 0x7f447625a8d0>
2025-05-07T20:31:57.6917518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6918034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f448572d9e0>}
2025-05-07T20:31:57.6918816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6919010Z context = <triton._C.libtriton.ir.context object at 0x7f4474bea7f0>
2025-05-07T20:31:57.6919020Z 
2025-05-07T20:31:57.6919188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6919466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6919577Z                            module_map=module_map)
2025-05-07T20:31:57.6919747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6919847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6919925Z E       ^
2025-05-07T20:31:57.6920296Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6920301Z 
2025-05-07T20:31:57.6920729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6920733Z 
2025-05-07T20:31:57.6920839Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6921074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6921157Z     T=2048,
2025-05-07T20:31:57.6921241Z     D=7168,
2025-05-07T20:31:57.6921324Z     scale_ub=None,
2025-05-07T20:31:57.6921412Z     contiguous=True,
2025-05-07T20:31:57.6921503Z     compiled=True,
2025-05-07T20:31:57.6921652Z )
2025-05-07T20:31:57.6921879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6922059Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.6922064Z 
2025-05-07T20:31:57.6922140Z     @given(
2025-05-07T20:31:57.6922260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6922366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6922481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6922607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6922722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6922796Z     )
2025-05-07T20:31:57.6923060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6923152Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6923227Z         self,
2025-05-07T20:31:57.6923312Z         T: int,
2025-05-07T20:31:57.6923396Z         D: int,
2025-05-07T20:31:57.6923493Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6923586Z         contiguous: bool,
2025-05-07T20:31:57.6923671Z         compiled: bool,
2025-05-07T20:31:57.6923748Z     ) -> None:
2025-05-07T20:31:57.6923848Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6923920Z     
2025-05-07T20:31:57.6924099Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6924173Z     
2025-05-07T20:31:57.6924264Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6924395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6924485Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6924564Z         x0 = x[:, :D]
2025-05-07T20:31:57.6924649Z         x1 = x[:, D:]
2025-05-07T20:31:57.6924829Z     
2025-05-07T20:31:57.6924913Z         if contiguous:
2025-05-07T20:31:57.6925010Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6925098Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6925168Z     
2025-05-07T20:31:57.6925270Z         if scale_ub is not None:
2025-05-07T20:31:57.6925375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6925520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6925596Z             )
2025-05-07T20:31:57.6925673Z         else:
2025-05-07T20:31:57.6925773Z             scale_ub_tensor = None
2025-05-07T20:31:57.6925849Z     
2025-05-07T20:31:57.6925978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6926075Z             op = silu_mul_quant
2025-05-07T20:31:57.6926163Z             if compiled:
2025-05-07T20:31:57.6926261Z                 op = torch.compile(op)
2025-05-07T20:31:57.6926372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6926452Z     
2025-05-07T20:31:57.6926542Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6926552Z 
2025-05-07T20:31:57.6926648Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6926784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6926890Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6926988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6927365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.6927462Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.6927969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6928067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6928440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6928666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6929031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6929126Z     kernel = self.compile(
2025-05-07T20:31:57.6929599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6929784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6929916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6929920Z 
2025-05-07T20:31:57.6930135Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476881a00>
2025-05-07T20:31:57.6930939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6931459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f447521d800>}
2025-05-07T20:31:57.6932356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6932550Z context = <triton._C.libtriton.ir.context object at 0x7f4474bf49f0>
2025-05-07T20:31:57.6932554Z 
2025-05-07T20:31:57.6932727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6932998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6933103Z                            module_map=module_map)
2025-05-07T20:31:57.6933274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6933374Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6933535Z E       ^
2025-05-07T20:31:57.6933900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6933905Z 
2025-05-07T20:31:57.6934342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6934346Z 
2025-05-07T20:31:57.6934456Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6934685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6934770Z     T=16384,
2025-05-07T20:31:57.6934847Z     D=5120,
2025-05-07T20:31:57.6934928Z     scale_ub=None,
2025-05-07T20:31:57.6935021Z     contiguous=False,
2025-05-07T20:31:57.6935105Z     compiled=False,
2025-05-07T20:31:57.6935178Z )
2025-05-07T20:31:57.6935412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6935594Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6935605Z 
2025-05-07T20:31:57.6935682Z     @given(
2025-05-07T20:31:57.6935807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6935907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6936026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6936150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6936267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6936347Z     )
2025-05-07T20:31:57.6936600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6936692Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6936776Z         self,
2025-05-07T20:31:57.6936853Z         T: int,
2025-05-07T20:31:57.6936930Z         D: int,
2025-05-07T20:31:57.6937037Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6937127Z         contiguous: bool,
2025-05-07T20:31:57.6937215Z         compiled: bool,
2025-05-07T20:31:57.6937303Z     ) -> None:
2025-05-07T20:31:57.6937398Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6937470Z     
2025-05-07T20:31:57.6937649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6937722Z     
2025-05-07T20:31:57.6937899Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6938025Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6939915Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.6939932Z 
2025-05-07T20:31:57.6940053Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:57.6940057Z 
2025-05-07T20:31:57.6940159Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6940398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6940475Z     T=4096,
2025-05-07T20:31:57.6940552Z     D=7168,
2025-05-07T20:31:57.6940640Z     scale_ub=1200.0,
2025-05-07T20:31:57.6940725Z     contiguous=True,
2025-05-07T20:31:57.6940809Z     compiled=True,
2025-05-07T20:31:57.6940887Z )
2025-05-07T20:31:57.6941111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6941291Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6941295Z 
2025-05-07T20:31:57.6941372Z     @given(
2025-05-07T20:31:57.6941491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6941593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6941706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6941934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6942055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6942127Z     )
2025-05-07T20:31:57.6942404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6942509Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6942598Z         self,
2025-05-07T20:31:57.6942690Z         T: int,
2025-05-07T20:31:57.6942767Z         D: int,
2025-05-07T20:31:57.6942864Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6942960Z         contiguous: bool,
2025-05-07T20:31:57.6943045Z         compiled: bool,
2025-05-07T20:31:57.6943122Z     ) -> None:
2025-05-07T20:31:57.6943223Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6943297Z     
2025-05-07T20:31:57.6943465Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6943545Z     
2025-05-07T20:31:57.6943635Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6943767Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6945639Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.6945646Z 
2025-05-07T20:31:57.6945770Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:57.6945774Z 
2025-05-07T20:31:57.6945876Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6946103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6946192Z     T=16384,
2025-05-07T20:31:57.6946267Z     D=7168,
2025-05-07T20:31:57.6946349Z     scale_ub=None,
2025-05-07T20:31:57.6946439Z     contiguous=False,
2025-05-07T20:31:57.6946521Z     compiled=False,
2025-05-07T20:31:57.6946595Z )
2025-05-07T20:31:57.6946902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6947083Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6947087Z 
2025-05-07T20:31:57.6947169Z     @given(
2025-05-07T20:31:57.6947289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6947386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6947507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6947623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6947737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6947816Z     )
2025-05-07T20:31:57.6948068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6948166Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6948248Z         self,
2025-05-07T20:31:57.6948326Z         T: int,
2025-05-07T20:31:57.6948410Z         D: int,
2025-05-07T20:31:57.6948513Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6948603Z         contiguous: bool,
2025-05-07T20:31:57.6948695Z         compiled: bool,
2025-05-07T20:31:57.6948774Z     ) -> None:
2025-05-07T20:31:57.6948869Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6948948Z     
2025-05-07T20:31:57.6949115Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6950980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.6951075Z 
2025-05-07T20:31:57.6951194Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.6951199Z 
2025-05-07T20:31:57.6951301Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6951533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6951610Z     T=2048,
2025-05-07T20:31:57.6951693Z     D=7168,
2025-05-07T20:31:57.6951778Z     scale_ub=1200.0,
2025-05-07T20:31:57.6951866Z     contiguous=True,
2025-05-07T20:31:57.6951955Z     compiled=True,
2025-05-07T20:31:57.6952028Z )
2025-05-07T20:31:57.6952251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6952432Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.6952441Z 
2025-05-07T20:31:57.6952520Z     @given(
2025-05-07T20:31:57.6952639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6952746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6952864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6952989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6953103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6953179Z     )
2025-05-07T20:31:57.6953440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6953534Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6953610Z         self,
2025-05-07T20:31:57.6953695Z         T: int,
2025-05-07T20:31:57.6953770Z         D: int,
2025-05-07T20:31:57.6953867Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6953966Z         contiguous: bool,
2025-05-07T20:31:57.6954051Z         compiled: bool,
2025-05-07T20:31:57.6954128Z     ) -> None:
2025-05-07T20:31:57.6954232Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6954304Z     
2025-05-07T20:31:57.6954471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6954550Z     
2025-05-07T20:31:57.6954755Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6954888Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6956736Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.6956742Z 
2025-05-07T20:31:57.6956871Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:57.6956876Z 
2025-05-07T20:31:57.6956995Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6957228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6957312Z     T=2048,
2025-05-07T20:31:57.6957390Z     D=7168,
2025-05-07T20:31:57.6957481Z     scale_ub=None,
2025-05-07T20:31:57.6957567Z     contiguous=True,
2025-05-07T20:31:57.6957651Z     compiled=False,
2025-05-07T20:31:57.6957733Z )
2025-05-07T20:31:57.6957954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6958131Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.6958141Z 
2025-05-07T20:31:57.6958218Z     @given(
2025-05-07T20:31:57.6958338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6958444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6958557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6958762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6958883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6958957Z     )
2025-05-07T20:31:57.6959213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6959314Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6959390Z         self,
2025-05-07T20:31:57.6959467Z         T: int,
2025-05-07T20:31:57.6959550Z         D: int,
2025-05-07T20:31:57.6959647Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6959743Z         contiguous: bool,
2025-05-07T20:31:57.6959829Z         compiled: bool,
2025-05-07T20:31:57.6959906Z     ) -> None:
2025-05-07T20:31:57.6960008Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6960082Z     
2025-05-07T20:31:57.6960249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6960332Z     
2025-05-07T20:31:57.6960425Z >       x_sign = torch.sign(x)
2025-05-07T20:31:57.6962291Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.6962297Z 
2025-05-07T20:31:57.6962416Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:57.6962421Z 
2025-05-07T20:31:57.6962523Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6962756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6962832Z     T=1,
2025-05-07T20:31:57.6962917Z     D=7168,
2025-05-07T20:31:57.6963002Z     scale_ub=1200.0,
2025-05-07T20:31:57.6963093Z     contiguous=True,
2025-05-07T20:31:57.6963184Z     compiled=False,
2025-05-07T20:31:57.6963255Z )
2025-05-07T20:31:57.6963477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6963736Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.6963741Z 
2025-05-07T20:31:57.6963819Z     @given(
2025-05-07T20:31:57.6963938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6964042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6964157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6964278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6964392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6964467Z     )
2025-05-07T20:31:57.6964722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6964814Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6964895Z         self,
2025-05-07T20:31:57.6964977Z         T: int,
2025-05-07T20:31:57.6965051Z         D: int,
2025-05-07T20:31:57.6965147Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6965242Z         contiguous: bool,
2025-05-07T20:31:57.6965332Z         compiled: bool,
2025-05-07T20:31:57.6965409Z     ) -> None:
2025-05-07T20:31:57.6965510Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6965586Z     
2025-05-07T20:31:57.6965760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6965833Z     
2025-05-07T20:31:57.6965924Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6966057Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6966146Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6966226Z         x0 = x[:, :D]
2025-05-07T20:31:57.6966313Z         x1 = x[:, D:]
2025-05-07T20:31:57.6966385Z     
2025-05-07T20:31:57.6966470Z         if contiguous:
2025-05-07T20:31:57.6966567Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6966740Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6966812Z     
2025-05-07T20:31:57.6966909Z         if scale_ub is not None:
2025-05-07T20:31:57.6967014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6967155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6967237Z             )
2025-05-07T20:31:57.6967314Z         else:
2025-05-07T20:31:57.6967414Z             scale_ub_tensor = None
2025-05-07T20:31:57.6967489Z     
2025-05-07T20:31:57.6967621Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6967718Z             op = silu_mul_quant
2025-05-07T20:31:57.6967803Z             if compiled:
2025-05-07T20:31:57.6967902Z                 op = torch.compile(op)
2025-05-07T20:31:57.6968017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6968090Z     
2025-05-07T20:31:57.6968181Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6968185Z 
2025-05-07T20:31:57.6968290Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6968427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6968537Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6968638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6969161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6969265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6969635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6969862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6970222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6970317Z     kernel = self.compile(
2025-05-07T20:31:57.6970720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6970903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6971034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6971121Z 
2025-05-07T20:31:57.6971338Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476f13fb0>
2025-05-07T20:31:57.6972293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6972819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5cea0>}
2025-05-07T20:31:57.6973596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6973793Z context = <triton._C.libtriton.ir.context object at 0x7f4474e1e5f0>
2025-05-07T20:31:57.6973806Z 
2025-05-07T20:31:57.6973981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6974252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6974368Z                            module_map=module_map)
2025-05-07T20:31:57.6974533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6974632Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6974715Z E       ^
2025-05-07T20:31:57.6975082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6975087Z 
2025-05-07T20:31:57.6975520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6975605Z 
2025-05-07T20:31:57.6975710Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6975937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6976022Z     T=128,
2025-05-07T20:31:57.6976105Z     D=5120,
2025-05-07T20:31:57.6976188Z     scale_ub=None,
2025-05-07T20:31:57.6976280Z     contiguous=True,
2025-05-07T20:31:57.6976365Z     compiled=False,
2025-05-07T20:31:57.6976440Z )
2025-05-07T20:31:57.6976671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6976846Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.6976851Z 
2025-05-07T20:31:57.6976934Z     @given(
2025-05-07T20:31:57.6977053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6977154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6977275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6977393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6977512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6977592Z     )
2025-05-07T20:31:57.6977849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6977948Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6978023Z         self,
2025-05-07T20:31:57.6978099Z         T: int,
2025-05-07T20:31:57.6978181Z         D: int,
2025-05-07T20:31:57.6978279Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6978369Z         contiguous: bool,
2025-05-07T20:31:57.6978460Z         compiled: bool,
2025-05-07T20:31:57.6978537Z     ) -> None:
2025-05-07T20:31:57.6978630Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6978712Z     
2025-05-07T20:31:57.6978883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6978957Z     
2025-05-07T20:31:57.6979055Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6979179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6979273Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6979359Z         x0 = x[:, :D]
2025-05-07T20:31:57.6979438Z         x1 = x[:, D:]
2025-05-07T20:31:57.6979515Z     
2025-05-07T20:31:57.6979681Z         if contiguous:
2025-05-07T20:31:57.6979776Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6979874Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6979946Z     
2025-05-07T20:31:57.6980035Z         if scale_ub is not None:
2025-05-07T20:31:57.6980144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6980281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6980355Z             )
2025-05-07T20:31:57.6980438Z         else:
2025-05-07T20:31:57.6980532Z             scale_ub_tensor = None
2025-05-07T20:31:57.6980604Z     
2025-05-07T20:31:57.6980741Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6980832Z             op = silu_mul_quant
2025-05-07T20:31:57.6980921Z             if compiled:
2025-05-07T20:31:57.6981027Z                 op = torch.compile(op)
2025-05-07T20:31:57.6981131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6981210Z     
2025-05-07T20:31:57.6981299Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6981311Z 
2025-05-07T20:31:57.6981409Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6981544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6981644Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6981745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6982299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6982411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6982785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6983011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6983443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6983542Z     kernel = self.compile(
2025-05-07T20:31:57.6983942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6984120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6984258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6984263Z 
2025-05-07T20:31:57.6984470Z self = <triton.compiler.compiler.ASTSource object at 0x7f4476b66090>
2025-05-07T20:31:57.6985281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6985806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5df80>}
2025-05-07T20:31:57.6986591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6986783Z context = <triton._C.libtriton.ir.context object at 0x7f4474932e30>
2025-05-07T20:31:57.6986788Z 
2025-05-07T20:31:57.6986957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6987236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6987342Z                            module_map=module_map)
2025-05-07T20:31:57.6987516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6987616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6987698Z E       ^
2025-05-07T20:31:57.6988070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6988075Z 
2025-05-07T20:31:57.6988604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6988609Z 
2025-05-07T20:31:57.6988722Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6988951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6989029Z     T=128,
2025-05-07T20:31:57.6989110Z     D=7168,
2025-05-07T20:31:57.6989193Z     scale_ub=None,
2025-05-07T20:31:57.6989279Z     contiguous=True,
2025-05-07T20:31:57.6989370Z     compiled=False,
2025-05-07T20:31:57.6989445Z )
2025-05-07T20:31:57.6989669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6989849Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.6989858Z 
2025-05-07T20:31:57.6989935Z     @given(
2025-05-07T20:31:57.6990054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6990160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6990284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6990407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6990523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6990597Z     )
2025-05-07T20:31:57.6990857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6990949Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6991025Z         self,
2025-05-07T20:31:57.6991108Z         T: int,
2025-05-07T20:31:57.6991184Z         D: int,
2025-05-07T20:31:57.6991281Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6991380Z         contiguous: bool,
2025-05-07T20:31:57.6991480Z         compiled: bool,
2025-05-07T20:31:57.6991576Z     ) -> None:
2025-05-07T20:31:57.6991775Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6991847Z     
2025-05-07T20:31:57.6992028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6992102Z     
2025-05-07T20:31:57.6992198Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6992329Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6992416Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6992496Z         x0 = x[:, :D]
2025-05-07T20:31:57.6992582Z         x1 = x[:, D:]
2025-05-07T20:31:57.6992654Z     
2025-05-07T20:31:57.6992739Z         if contiguous:
2025-05-07T20:31:57.6992836Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6992924Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6992994Z     
2025-05-07T20:31:57.6993089Z         if scale_ub is not None:
2025-05-07T20:31:57.6993193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6993335Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6993418Z             )
2025-05-07T20:31:57.6993494Z         else:
2025-05-07T20:31:57.6993594Z             scale_ub_tensor = None
2025-05-07T20:31:57.6993665Z     
2025-05-07T20:31:57.6993794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6993894Z             op = silu_mul_quant
2025-05-07T20:31:57.6993979Z             if compiled:
2025-05-07T20:31:57.6994077Z                 op = torch.compile(op)
2025-05-07T20:31:57.6994188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6994260Z     
2025-05-07T20:31:57.6994351Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6994362Z 
2025-05-07T20:31:57.6994459Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6994592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6994697Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6994797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6995313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6995421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6995876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6996105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6996466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6996560Z     kernel = self.compile(
2025-05-07T20:31:57.6996961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6997138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6997271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6997275Z 
2025-05-07T20:31:57.6997488Z self = <triton.compiler.compiler.ASTSource object at 0x7f4477988680>
2025-05-07T20:31:57.6998300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6998823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474e5ee80>}
2025-05-07T20:31:57.6999596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6999793Z context = <triton._C.libtriton.ir.context object at 0x7f4474841330>
2025-05-07T20:31:57.6999798Z 
2025-05-07T20:31:57.6999966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7000238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7000427Z                            module_map=module_map)
2025-05-07T20:31:57.7000592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7000694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7000780Z E       ^
2025-05-07T20:31:57.7001145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7001149Z 
2025-05-07T20:31:57.7001586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7001590Z 
2025-05-07T20:31:57.7001694Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7001923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7002006Z     T=2048,
2025-05-07T20:31:57.7002087Z     D=7168,
2025-05-07T20:31:57.7002170Z     scale_ub=1200.0,
2025-05-07T20:31:57.7002265Z     contiguous=True,
2025-05-07T20:31:57.7002350Z     compiled=False,
2025-05-07T20:31:57.7002430Z )
2025-05-07T20:31:57.7002654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7002839Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7002843Z 
2025-05-07T20:31:57.7002931Z     @given(
2025-05-07T20:31:57.7003051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7003152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7003275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7003392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7003508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7003591Z     )
2025-05-07T20:31:57.7003843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7003946Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7004027Z         self,
2025-05-07T20:31:57.7004105Z         T: int,
2025-05-07T20:31:57.7004190Z         D: int,
2025-05-07T20:31:57.7004288Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7004379Z         contiguous: bool,
2025-05-07T20:31:57.7004558Z         compiled: bool,
2025-05-07T20:31:57.7004638Z     ) -> None:
2025-05-07T20:31:57.7004734Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7004813Z     
2025-05-07T20:31:57.7004982Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7007229Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7007250Z 
2025-05-07T20:31:57.7007378Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7007383Z 
2025-05-07T20:31:57.7007501Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7007729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7007806Z     T=1,
2025-05-07T20:31:57.7007890Z     D=5120,
2025-05-07T20:31:57.7007974Z     scale_ub=1200.0,
2025-05-07T20:31:57.7008059Z     contiguous=True,
2025-05-07T20:31:57.7008148Z     compiled=False,
2025-05-07T20:31:57.7008222Z )
2025-05-07T20:31:57.7008444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7008623Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7008627Z 
2025-05-07T20:31:57.7008703Z     @given(
2025-05-07T20:31:57.7008831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7009149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7009265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7009387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7009505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7009581Z     )
2025-05-07T20:31:57.7009838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7009930Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7010005Z         self,
2025-05-07T20:31:57.7010087Z         T: int,
2025-05-07T20:31:57.7010162Z         D: int,
2025-05-07T20:31:57.7010260Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7010354Z         contiguous: bool,
2025-05-07T20:31:57.7010439Z         compiled: bool,
2025-05-07T20:31:57.7010523Z     ) -> None:
2025-05-07T20:31:57.7010618Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7010691Z     
2025-05-07T20:31:57.7010866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7010946Z     
2025-05-07T20:31:57.7011038Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7011167Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7011260Z         x = x_sign * x_clamp
2025-05-07T20:31:57.7011337Z         x0 = x[:, :D]
2025-05-07T20:31:57.7011425Z         x1 = x[:, D:]
2025-05-07T20:31:57.7011496Z     
2025-05-07T20:31:57.7011579Z         if contiguous:
2025-05-07T20:31:57.7011678Z             x0 = x0.contiguous()
2025-05-07T20:31:57.7011843Z             x1 = x1.contiguous()
2025-05-07T20:31:57.7011917Z     
2025-05-07T20:31:57.7012017Z         if scale_ub is not None:
2025-05-07T20:31:57.7012146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.7012308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.7012388Z             )
2025-05-07T20:31:57.7012469Z         else:
2025-05-07T20:31:57.7012572Z             scale_ub_tensor = None
2025-05-07T20:31:57.7012650Z     
2025-05-07T20:31:57.7012780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.7012875Z             op = silu_mul_quant
2025-05-07T20:31:57.7012959Z             if compiled:
2025-05-07T20:31:57.7013195Z                 op = torch.compile(op)
2025-05-07T20:31:57.7013309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7013382Z     
2025-05-07T20:31:57.7013471Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.7013482Z 
2025-05-07T20:31:57.7013580Z moe/activation_test.py:117: 
2025-05-07T20:31:57.7013712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7013820Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.7013919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7014440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.7014542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.7014919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.7015154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.7015512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.7015606Z     kernel = self.compile(
2025-05-07T20:31:57.7016008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.7016186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7016315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7016320Z 
2025-05-07T20:31:57.7016536Z self = <triton.compiler.compiler.ASTSource object at 0x7f447798a7e0>
2025-05-07T20:31:57.7017345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.7018000Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474a1c400>}
2025-05-07T20:31:57.7018773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.7018973Z context = <triton._C.libtriton.ir.context object at 0x7f4474a835b0>
2025-05-07T20:31:57.7018977Z 
2025-05-07T20:31:57.7019147Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7019420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7026040Z                            module_map=module_map)
2025-05-07T20:31:57.7026246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7026349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7026427Z E       ^
2025-05-07T20:31:57.7026809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7026815Z 
2025-05-07T20:31:57.7027252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7027256Z 
2025-05-07T20:31:57.7027371Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7027600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7027678Z     T=2048,
2025-05-07T20:31:57.7027766Z     D=5120,
2025-05-07T20:31:57.7027849Z     scale_ub=None,
2025-05-07T20:31:57.7027935Z     contiguous=True,
2025-05-07T20:31:57.7028029Z     compiled=False,
2025-05-07T20:31:57.7028106Z )
2025-05-07T20:31:57.7028331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7028521Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.7028525Z 
2025-05-07T20:31:57.7028720Z     @given(
2025-05-07T20:31:57.7028860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7028962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7029080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7029210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7029326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7029402Z     )
2025-05-07T20:31:57.7029666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7029761Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7029839Z         self,
2025-05-07T20:31:57.7029925Z         T: int,
2025-05-07T20:31:57.7030002Z         D: int,
2025-05-07T20:31:57.7030114Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7030204Z         contiguous: bool,
2025-05-07T20:31:57.7030290Z         compiled: bool,
2025-05-07T20:31:57.7030380Z     ) -> None:
2025-05-07T20:31:57.7030482Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7030559Z     
2025-05-07T20:31:57.7030739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7030817Z     
2025-05-07T20:31:57.7030914Z >       x_sign = torch.sign(x)
2025-05-07T20:31:57.7032786Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7032877Z 
2025-05-07T20:31:57.7033001Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:57.7033005Z 
2025-05-07T20:31:57.7033121Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7033348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7033435Z     T=16384,
2025-05-07T20:31:57.7033511Z     D=5120,
2025-05-07T20:31:57.7033594Z     scale_ub=None,
2025-05-07T20:31:57.7033687Z     contiguous=True,
2025-05-07T20:31:57.7033771Z     compiled=False,
2025-05-07T20:31:57.7033843Z )
2025-05-07T20:31:57.7034073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7034253Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.7034257Z 
2025-05-07T20:31:57.7034342Z     @given(
2025-05-07T20:31:57.7034462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7034567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7034689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7034808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7034928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7035010Z     )
2025-05-07T20:31:57.7035262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7035363Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7035441Z         self,
2025-05-07T20:31:57.7035518Z         T: int,
2025-05-07T20:31:57.7035602Z         D: int,
2025-05-07T20:31:57.7035700Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7035790Z         contiguous: bool,
2025-05-07T20:31:57.7035884Z         compiled: bool,
2025-05-07T20:31:57.7035964Z     ) -> None:
2025-05-07T20:31:57.7036059Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7036139Z     
2025-05-07T20:31:57.7036310Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7038261Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7038268Z 
2025-05-07T20:31:57.7038388Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7038393Z 
2025-05-07T20:31:57.7038496Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7038730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7038808Z     T=4096,
2025-05-07T20:31:57.7038890Z     D=5120,
2025-05-07T20:31:57.7038978Z     scale_ub=None,
2025-05-07T20:31:57.7039061Z     contiguous=True,
2025-05-07T20:31:57.7039152Z     compiled=False,
2025-05-07T20:31:57.7039225Z )
2025-05-07T20:31:57.7039450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7039632Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.7039637Z 
2025-05-07T20:31:57.7039713Z     @given(
2025-05-07T20:31:57.7039832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7039936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7040049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7040171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7040285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7040357Z     )
2025-05-07T20:31:57.7040618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7040711Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7040865Z         self,
2025-05-07T20:31:57.7040947Z         T: int,
2025-05-07T20:31:57.7041022Z         D: int,
2025-05-07T20:31:57.7041118Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7041217Z         contiguous: bool,
2025-05-07T20:31:57.7041302Z         compiled: bool,
2025-05-07T20:31:57.7041380Z     ) -> None:
2025-05-07T20:31:57.7041480Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7041551Z     
2025-05-07T20:31:57.7041727Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7043929Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7043945Z 
2025-05-07T20:31:57.7044072Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7044082Z 
2025-05-07T20:31:57.7044183Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7044410Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7044497Z     T=2048,
2025-05-07T20:31:57.7044573Z     D=5120,
2025-05-07T20:31:57.7044656Z     scale_ub=None,
2025-05-07T20:31:57.7044748Z     contiguous=False,
2025-05-07T20:31:57.7044831Z     compiled=False,
2025-05-07T20:31:57.7044904Z )
2025-05-07T20:31:57.7045130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7045307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.7045312Z 
2025-05-07T20:31:57.7045395Z     @given(
2025-05-07T20:31:57.7045517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7045617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7045736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7045937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7046054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7046133Z     )
2025-05-07T20:31:57.7046385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7046485Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7046561Z         self,
2025-05-07T20:31:57.7046638Z         T: int,
2025-05-07T20:31:57.7046720Z         D: int,
2025-05-07T20:31:57.7046816Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7046903Z         contiguous: bool,
2025-05-07T20:31:57.7046995Z         compiled: bool,
2025-05-07T20:31:57.7047072Z     ) -> None:
2025-05-07T20:31:57.7047166Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7047243Z     
2025-05-07T20:31:57.7047416Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7049274Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7049280Z 
2025-05-07T20:31:57.7049398Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7049403Z 
2025-05-07T20:31:57.7049504Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7049736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7049890Z     T=4096,
2025-05-07T20:31:57.7049974Z     D=7168,
2025-05-07T20:31:57.7050058Z     scale_ub=None,
2025-05-07T20:31:57.7050144Z     contiguous=True,
2025-05-07T20:31:57.7050233Z     compiled=True,
2025-05-07T20:31:57.7050307Z )
2025-05-07T20:31:57.7050535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7050712Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.7050716Z 
2025-05-07T20:31:57.7050794Z     @given(
2025-05-07T20:31:57.7050914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7051019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7051133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7051255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7051370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7051444Z     )
2025-05-07T20:31:57.7051703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7051880Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7051957Z         self,
2025-05-07T20:31:57.7052040Z         T: int,
2025-05-07T20:31:57.7052116Z         D: int,
2025-05-07T20:31:57.7052219Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7052318Z         contiguous: bool,
2025-05-07T20:31:57.7052422Z         compiled: bool,
2025-05-07T20:31:57.7052507Z     ) -> None:
2025-05-07T20:31:57.7052632Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7052707Z     
2025-05-07T20:31:57.7052882Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7054742Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7054752Z 
2025-05-07T20:31:57.7054958Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7054963Z 
2025-05-07T20:31:57.7055066Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7055296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7055385Z     T=2048,
2025-05-07T20:31:57.7055462Z     D=5120,
2025-05-07T20:31:57.7055545Z     scale_ub=1200.0,
2025-05-07T20:31:57.7055638Z     contiguous=False,
2025-05-07T20:31:57.7055723Z     compiled=False,
2025-05-07T20:31:57.7055796Z )
2025-05-07T20:31:57.7056022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7056201Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.7056209Z 
2025-05-07T20:31:57.7056290Z     @given(
2025-05-07T20:31:57.7056409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7056507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7056631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7056749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7056861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7056940Z     )
2025-05-07T20:31:57.7057195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7057295Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7057371Z         self,
2025-05-07T20:31:57.7057448Z         T: int,
2025-05-07T20:31:57.7057528Z         D: int,
2025-05-07T20:31:57.7057625Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7057714Z         contiguous: bool,
2025-05-07T20:31:57.7057805Z         compiled: bool,
2025-05-07T20:31:57.7057881Z     ) -> None:
2025-05-07T20:31:57.7058081Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7058159Z     
2025-05-07T20:31:57.7058328Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7060196Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7060202Z 
2025-05-07T20:31:57.7060320Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7060325Z 
2025-05-07T20:31:57.7060426Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7060664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7060746Z     T=4096,
2025-05-07T20:31:57.7060830Z     D=7168,
2025-05-07T20:31:57.7060913Z     scale_ub=1200.0,
2025-05-07T20:31:57.7060999Z     contiguous=True,
2025-05-07T20:31:57.7061096Z     compiled=False,
2025-05-07T20:31:57.7061168Z )
2025-05-07T20:31:57.7061390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7061574Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7061579Z 
2025-05-07T20:31:57.7061656Z     @given(
2025-05-07T20:31:57.7061773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7061877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7061991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7062113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7062227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7062306Z     )
2025-05-07T20:31:57.7062564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7062658Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7062735Z         self,
2025-05-07T20:31:57.7062901Z         T: int,
2025-05-07T20:31:57.7062981Z         D: int,
2025-05-07T20:31:57.7063077Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7063170Z         contiguous: bool,
2025-05-07T20:31:57.7063257Z         compiled: bool,
2025-05-07T20:31:57.7063335Z     ) -> None:
2025-05-07T20:31:57.7063436Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7063510Z     
2025-05-07T20:31:57.7063685Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7065542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7065553Z 
2025-05-07T20:31:57.7065679Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7065683Z 
2025-05-07T20:31:57.7065785Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7066011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7066099Z     T=16384,
2025-05-07T20:31:57.7066176Z     D=7168,
2025-05-07T20:31:57.7066259Z     scale_ub=None,
2025-05-07T20:31:57.7066355Z     contiguous=False,
2025-05-07T20:31:57.7066437Z     compiled=True,
2025-05-07T20:31:57.7066509Z )
2025-05-07T20:31:57.7066737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7066915Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.7066998Z 
2025-05-07T20:31:57.7067082Z     @given(
2025-05-07T20:31:57.7067205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7067309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7067429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7067547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7067661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7067741Z     )
2025-05-07T20:31:57.7067991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7068090Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7068168Z         self,
2025-05-07T20:31:57.7068245Z         T: int,
2025-05-07T20:31:57.7068326Z         D: int,
2025-05-07T20:31:57.7068424Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7068511Z         contiguous: bool,
2025-05-07T20:31:57.7068602Z         compiled: bool,
2025-05-07T20:31:57.7068685Z     ) -> None:
2025-05-07T20:31:57.7068779Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7068858Z     
2025-05-07T20:31:57.7069026Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7070888Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7070894Z 
2025-05-07T20:31:57.7071013Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7071017Z 
2025-05-07T20:31:57.7071124Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7071358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7071434Z     T=4096,
2025-05-07T20:31:57.7071517Z     D=7168,
2025-05-07T20:31:57.7071679Z     scale_ub=None,
2025-05-07T20:31:57.7071766Z     contiguous=True,
2025-05-07T20:31:57.7071856Z     compiled=False,
2025-05-07T20:31:57.7071928Z )
2025-05-07T20:31:57.7072172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7072377Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.7072382Z 
2025-05-07T20:31:57.7072459Z     @given(
2025-05-07T20:31:57.7072576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7072680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7072793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7072917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7073034Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7073108Z     )
2025-05-07T20:31:57.7073364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7073463Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7073539Z         self,
2025-05-07T20:31:57.7073625Z         T: int,
2025-05-07T20:31:57.7073699Z         D: int,
2025-05-07T20:31:57.7073795Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7073887Z         contiguous: bool,
2025-05-07T20:31:57.7073972Z         compiled: bool,
2025-05-07T20:31:57.7074049Z     ) -> None:
2025-05-07T20:31:57.7074149Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7074225Z     
2025-05-07T20:31:57.7074399Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7076256Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7076341Z 
2025-05-07T20:31:57.7076465Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7076469Z 
2025-05-07T20:31:57.7076570Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7076798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7076882Z     T=16384,
2025-05-07T20:31:57.7076959Z     D=7168,
2025-05-07T20:31:57.7077039Z     scale_ub=None,
2025-05-07T20:31:57.7077129Z     contiguous=True,
2025-05-07T20:31:57.7077213Z     compiled=False,
2025-05-07T20:31:57.7077288Z )
2025-05-07T20:31:57.7077516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7077701Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.7077705Z 
2025-05-07T20:31:57.7077790Z     @given(
2025-05-07T20:31:57.7077913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7078011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7078130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7078246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7078359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7078439Z     )
2025-05-07T20:31:57.7078688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7078786Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7078862Z         self,
2025-05-07T20:31:57.7078939Z         T: int,
2025-05-07T20:31:57.7079021Z         D: int,
2025-05-07T20:31:57.7079117Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7079208Z         contiguous: bool,
2025-05-07T20:31:57.7079298Z         compiled: bool,
2025-05-07T20:31:57.7079376Z     ) -> None:
2025-05-07T20:31:57.7079471Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7079550Z     
2025-05-07T20:31:57.7079796Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7081656Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7081662Z 
2025-05-07T20:31:57.7081779Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7081788Z 
2025-05-07T20:31:57.7081890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7082122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7082205Z     T=16384,
2025-05-07T20:31:57.7082287Z     D=7168,
2025-05-07T20:31:57.7082372Z     scale_ub=1200.0,
2025-05-07T20:31:57.7082472Z     contiguous=True,
2025-05-07T20:31:57.7082573Z     compiled=False,
2025-05-07T20:31:57.7082662Z )
2025-05-07T20:31:57.7082892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7083079Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7083083Z 
2025-05-07T20:31:57.7083160Z     @given(
2025-05-07T20:31:57.7083279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7083384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7083497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7083698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7083813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7083887Z     )
2025-05-07T20:31:57.7084151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7084244Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7084322Z         self,
2025-05-07T20:31:57.7084406Z         T: int,
2025-05-07T20:31:57.7084481Z         D: int,
2025-05-07T20:31:57.7084578Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7084672Z         contiguous: bool,
2025-05-07T20:31:57.7084759Z         compiled: bool,
2025-05-07T20:31:57.7084837Z     ) -> None:
2025-05-07T20:31:57.7084939Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7085011Z     
2025-05-07T20:31:57.7085188Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7087048Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7087059Z 
2025-05-07T20:31:57.7087182Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7087187Z 
2025-05-07T20:31:57.7087289Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7087516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7087600Z     T=128,
2025-05-07T20:31:57.7087677Z     D=5120,
2025-05-07T20:31:57.7087761Z     scale_ub=1200.0,
2025-05-07T20:31:57.7087853Z     contiguous=False,
2025-05-07T20:31:57.7087938Z     compiled=False,
2025-05-07T20:31:57.7088015Z )
2025-05-07T20:31:57.7088243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7088418Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.7088422Z 
2025-05-07T20:31:57.7088606Z     @given(
2025-05-07T20:31:57.7088727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7088824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7088946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7089063Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7089179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7089258Z     )
2025-05-07T20:31:57.7089508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7089620Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7089695Z         self,
2025-05-07T20:31:57.7089772Z         T: int,
2025-05-07T20:31:57.7089855Z         D: int,
2025-05-07T20:31:57.7089957Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7090051Z         contiguous: bool,
2025-05-07T20:31:57.7090136Z         compiled: bool,
2025-05-07T20:31:57.7090212Z     ) -> None:
2025-05-07T20:31:57.7090319Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7090390Z     
2025-05-07T20:31:57.7090559Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7090640Z     
2025-05-07T20:31:57.7090733Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7090858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7090959Z         x = x_sign * x_clamp
2025-05-07T20:31:57.7091040Z         x0 = x[:, :D]
2025-05-07T20:31:57.7091120Z         x1 = x[:, D:]
2025-05-07T20:31:57.7091200Z     
2025-05-07T20:31:57.7091284Z         if contiguous:
2025-05-07T20:31:57.7091383Z             x0 = x0.contiguous()
2025-05-07T20:31:57.7091473Z             x1 = x1.contiguous()
2025-05-07T20:31:57.7091545Z     
2025-05-07T20:31:57.7091723Z         if scale_ub is not None:
2025-05-07T20:31:57.7091893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.7092031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.7092114Z             )
2025-05-07T20:31:57.7092196Z         else:
2025-05-07T20:31:57.7092290Z             scale_ub_tensor = None
2025-05-07T20:31:57.7092371Z     
2025-05-07T20:31:57.7092502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.7092593Z             op = silu_mul_quant
2025-05-07T20:31:57.7092684Z             if compiled:
2025-05-07T20:31:57.7092782Z                 op = torch.compile(op)
2025-05-07T20:31:57.7092894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7092967Z     
2025-05-07T20:31:57.7093057Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.7093061Z 
2025-05-07T20:31:57.7093166Z moe/activation_test.py:117: 
2025-05-07T20:31:57.7093298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7093404Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.7093512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7094038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.7094136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.7094513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.7094741Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.7095098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.7095192Z     kernel = self.compile(
2025-05-07T20:31:57.7095591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.7095776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7095911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7095915Z 
2025-05-07T20:31:57.7096217Z self = <triton.compiler.compiler.ASTSource object at 0x7f4474a65250>
2025-05-07T20:31:57.7097031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.7097550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4474ab2fc0>}
2025-05-07T20:31:57.7098336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.7098533Z context = <triton._C.libtriton.ir.context object at 0x7f4474701e30>
2025-05-07T20:31:57.7098538Z 
2025-05-07T20:31:57.7098712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7098986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7099093Z                            module_map=module_map)
2025-05-07T20:31:57.7099263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7099362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7099446Z E       ^
2025-05-07T20:31:57.7099814Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7099819Z 
2025-05-07T20:31:57.7100243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7100248Z 
2025-05-07T20:31:57.7100356Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7100663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7100745Z     T=2048,
2025-05-07T20:31:57.7100821Z     D=7168,
2025-05-07T20:31:57.7100904Z     scale_ub=None,
2025-05-07T20:31:57.7100998Z     contiguous=False,
2025-05-07T20:31:57.7101084Z     compiled=False,
2025-05-07T20:31:57.7101158Z )
2025-05-07T20:31:57.7101401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7101609Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.7101615Z 
2025-05-07T20:31:57.7101695Z     @given(
2025-05-07T20:31:57.7101820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7101919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7102041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7102161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7102274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7102358Z     )
2025-05-07T20:31:57.7102612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7102705Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7102791Z         self,
2025-05-07T20:31:57.7102868Z         T: int,
2025-05-07T20:31:57.7102944Z         D: int,
2025-05-07T20:31:57.7103054Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7103142Z         contiguous: bool,
2025-05-07T20:31:57.7103227Z         compiled: bool,
2025-05-07T20:31:57.7103311Z     ) -> None:
2025-05-07T20:31:57.7103406Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7103478Z     
2025-05-07T20:31:57.7103655Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7105591Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7105609Z 
2025-05-07T20:31:57.7105732Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7105737Z 
2025-05-07T20:31:57.7105839Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7106072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7106442Z     T=128,
2025-05-07T20:31:57.7106558Z     D=7168,
2025-05-07T20:31:57.7106693Z     scale_ub=1200.0,
2025-05-07T20:31:57.7106794Z     contiguous=True,
2025-05-07T20:31:57.7106877Z     compiled=True,
2025-05-07T20:31:57.7106956Z )
2025-05-07T20:31:57.7107181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7107366Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.7107371Z 
2025-05-07T20:31:57.7107448Z     @given(
2025-05-07T20:31:57.7107569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7107680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7107796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7107916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7108037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7108111Z     )
2025-05-07T20:31:57.7108364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7108464Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7108540Z         self,
2025-05-07T20:31:57.7108623Z         T: int,
2025-05-07T20:31:57.7108700Z         D: int,
2025-05-07T20:31:57.7108798Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7108895Z         contiguous: bool,
2025-05-07T20:31:57.7109224Z         compiled: bool,
2025-05-07T20:31:57.7109300Z     ) -> None:
2025-05-07T20:31:57.7109399Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7109470Z     
2025-05-07T20:31:57.7109643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7109727Z     
2025-05-07T20:31:57.7109823Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7109949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7110046Z         x = x_sign * x_clamp
2025-05-07T20:31:57.7110125Z         x0 = x[:, :D]
2025-05-07T20:31:57.7110207Z         x1 = x[:, D:]
2025-05-07T20:31:57.7110285Z     
2025-05-07T20:31:57.7110369Z         if contiguous:
2025-05-07T20:31:57.7110468Z             x0 = x0.contiguous()
2025-05-07T20:31:57.7110560Z             x1 = x1.contiguous()
2025-05-07T20:31:57.7110632Z     
2025-05-07T20:31:57.7110725Z         if scale_ub is not None:
2025-05-07T20:31:57.7110830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.7110975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.7111058Z             )
2025-05-07T20:31:57.7111133Z         else:
2025-05-07T20:31:57.7111227Z             scale_ub_tensor = None
2025-05-07T20:31:57.7111305Z     
2025-05-07T20:31:57.7111441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.7111532Z             op = silu_mul_quant
2025-05-07T20:31:57.7111622Z             if compiled:
2025-05-07T20:31:57.7111721Z                 op = torch.compile(op)
2025-05-07T20:31:57.7111835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7111908Z     
2025-05-07T20:31:57.7112020Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.7112026Z 
2025-05-07T20:31:57.7112137Z moe/activation_test.py:117: 
2025-05-07T20:31:57.7112286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7112387Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.7112493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.7112879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.7112972Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.7113620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.7113721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.7114098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.7114325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.7114677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.7114776Z     kernel = self.compile(
2025-05-07T20:31:57.7115171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.7115360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7115491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.7115496Z 
2025-05-07T20:31:57.7115708Z self = <triton.compiler.compiler.ASTSource object at 0x7f44747efe00>
2025-05-07T20:31:57.7116523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.7117039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f44da419300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f44747c5120>}
2025-05-07T20:31:57.7117824Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.7118097Z context = <triton._C.libtriton.ir.context object at 0x7f44747ddf70>
2025-05-07T20:31:57.7118102Z 
2025-05-07T20:31:57.7118275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.7118555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7118661Z                            module_map=module_map)
2025-05-07T20:31:57.7118830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7118928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7119005Z E       ^
2025-05-07T20:31:57.7119378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.7119383Z 
2025-05-07T20:31:57.7119811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.7119822Z 
2025-05-07T20:31:57.7119928Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7120159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7120236Z     T=128,
2025-05-07T20:31:57.7120316Z     D=7168,
2025-05-07T20:31:57.7120406Z     scale_ub=1200.0,
2025-05-07T20:31:57.7120490Z     contiguous=True,
2025-05-07T20:31:57.7120582Z     compiled=False,
2025-05-07T20:31:57.7120654Z )
2025-05-07T20:31:57.7120876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7121056Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.7121060Z 
2025-05-07T20:31:57.7121136Z     @given(
2025-05-07T20:31:57.7121263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7121364Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7121479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7121601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7121724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7121803Z     )
2025-05-07T20:31:57.7122060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7122272Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7122364Z         self,
2025-05-07T20:31:57.7122461Z         T: int,
2025-05-07T20:31:57.7122550Z         D: int,
2025-05-07T20:31:57.7122655Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7122746Z         contiguous: bool,
2025-05-07T20:31:57.7122832Z         compiled: bool,
2025-05-07T20:31:57.7122918Z     ) -> None:
2025-05-07T20:31:57.7123013Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7123086Z     
2025-05-07T20:31:57.7123262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7123337Z     
2025-05-07T20:31:57.7123429Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.7123560Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.7125423Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7125429Z 
2025-05-07T20:31:57.7125555Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:57.7125560Z 
2025-05-07T20:31:57.7125663Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7125897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7125977Z     T=128,
2025-05-07T20:31:57.7126054Z     D=5120,
2025-05-07T20:31:57.7126143Z     scale_ub=1200.0,
2025-05-07T20:31:57.7126312Z     contiguous=True,
2025-05-07T20:31:57.7126397Z     compiled=True,
2025-05-07T20:31:57.7126477Z )
2025-05-07T20:31:57.7126701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7126878Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.7126883Z 
2025-05-07T20:31:57.7126968Z     @given(
2025-05-07T20:31:57.7127088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7127187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7127306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7127423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7127542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7127615Z     )
2025-05-07T20:31:57.7127866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7127964Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7128046Z         self,
2025-05-07T20:31:57.7128122Z         T: int,
2025-05-07T20:31:57.7128202Z         D: int,
2025-05-07T20:31:57.7128298Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7128388Z         contiguous: bool,
2025-05-07T20:31:57.7128482Z         compiled: bool,
2025-05-07T20:31:57.7128561Z     ) -> None:
2025-05-07T20:31:57.7128655Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7128733Z     
2025-05-07T20:31:57.7128899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7128978Z     
2025-05-07T20:31:57.7129070Z >       x_sign = torch.sign(x)
2025-05-07T20:31:57.7130911Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7130929Z 
2025-05-07T20:31:57.7131126Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:57.7131131Z 
2025-05-07T20:31:57.7131234Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.7131466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.7131545Z     T=128,
2025-05-07T20:31:57.7131620Z     D=7168,
2025-05-07T20:31:57.7131708Z     scale_ub=None,
2025-05-07T20:31:57.7131861Z     contiguous=True,
2025-05-07T20:31:57.7131943Z     compiled=True,
2025-05-07T20:31:57.7132021Z )
2025-05-07T20:31:57.7132245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.7132421Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.7132425Z 
2025-05-07T20:31:57.7132512Z     @given(
2025-05-07T20:31:57.7132630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.7132733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.7132855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.7132973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.7133091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.7133168Z     )
2025-05-07T20:31:57.7133419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.7133520Z     def test_silu_mul_quant(
2025-05-07T20:31:57.7133597Z         self,
2025-05-07T20:31:57.7133679Z         T: int,
2025-05-07T20:31:57.7133759Z         D: int,
2025-05-07T20:31:57.7133856Z         scale_ub: Optional[float],
2025-05-07T20:31:57.7133951Z         contiguous: bool,
2025-05-07T20:31:57.7134037Z         compiled: bool,
2025-05-07T20:31:57.7134115Z     ) -> None:
2025-05-07T20:31:57.7134216Z         torch.manual_seed(2025)
2025-05-07T20:31:57.7134369Z     
2025-05-07T20:31:57.7134539Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.7136390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:57.7136396Z 
2025-05-07T20:31:57.7136515Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:57.7136657Z =============================== warnings summary ===============================
2025-05-07T20:31:57.7136976Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:57.7137302Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:57.7137619Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:57.7138530Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:57.7138775Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:57.7138779Z 
2025-05-07T20:31:57.7138966Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:57.7140377Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:57.7140580Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:57.7140585Z 
2025-05-07T20:31:57.7140808Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:57.7140979Z ================== 1 failed, 1 passed, 13 warnings in 20.16s ===================
2025-05-07T20:31:59.4284684Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:59.4905971Z 
2025-05-07T20:31:59.4906689Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:59.4907187Z 
2025-05-07T20:31:59.4907220Z 
2025-05-07T20:31:59.4928872Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:01.6431735Z ============================= test session starts ==============================
2025-05-07T20:32:01.6432555Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:01.6433150Z cachedir: .pytest_cache
2025-05-07T20:32:01.6433753Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:01.6434494Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:01.6434914Z plugins: hypothesis-6.131.14
2025-05-07T20:32:03.2566738Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:03.3640196Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:03.3641079Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:03.3641300Z 
2025-05-07T20:32:05.4355249Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.4356397Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:05.4357789Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.4359284Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.4368308Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4369695Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.4371143Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4372234Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4373505Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.4375342Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4376457Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4377786Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.4379085Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:05.4380357Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.4381619Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:05.4382479Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4383543Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:05.4384609Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:05.4385423Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:05.4386841Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.4388170Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.4389335Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.4390419Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:05.4391632Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.4393054Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.4394157Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4395098Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4395864Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:05.4396909Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4514402Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.4516216Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:05.4517610Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.4519086Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.4520093Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4521449Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.4522878Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4523892Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4525209Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.4526631Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4527868Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4529190Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.4530475Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:05.4531730Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.4533097Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:05.4533954Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4535014Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:05.4536075Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:05.4536897Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:05.4538150Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.4539585Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.4540746Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.4541969Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:05.4543198Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.4544654Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.4545768Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4546712Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4547475Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:05.4548534Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8625723Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8626820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8627240Z     T=1,
2025-05-07T20:32:05.8627426Z     D=5120,
2025-05-07T20:32:05.8627613Z     scale_ub=None,
2025-05-07T20:32:05.8627836Z     contiguous=True,
2025-05-07T20:32:05.8628056Z     compiled=True,
2025-05-07T20:32:05.8628270Z )
2025-05-07T20:32:05.8628598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8629098Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.8629370Z 
2025-05-07T20:32:05.8629452Z     @given(
2025-05-07T20:32:05.8629692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8630009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8630328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8630664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8631000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8631292Z     )
2025-05-07T20:32:05.8631653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8632108Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8632350Z         self,
2025-05-07T20:32:05.8632557Z         T: int,
2025-05-07T20:32:05.8632765Z         D: int,
2025-05-07T20:32:05.8632984Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8633262Z         contiguous: bool,
2025-05-07T20:32:05.8633511Z         compiled: bool,
2025-05-07T20:32:05.8633740Z     ) -> None:
2025-05-07T20:32:05.8633962Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8634216Z     
2025-05-07T20:32:05.8634531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8634884Z     
2025-05-07T20:32:05.8635085Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8635385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8635697Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8635943Z         x0 = x[:, :D]
2025-05-07T20:32:05.8636173Z         x1 = x[:, D:]
2025-05-07T20:32:05.8636380Z     
2025-05-07T20:32:05.8636571Z         if contiguous:
2025-05-07T20:32:05.8636810Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8637226Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8637472Z     
2025-05-07T20:32:05.8637669Z         if scale_ub is not None:
2025-05-07T20:32:05.8637942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8638288Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8638606Z             )
2025-05-07T20:32:05.8638798Z         else:
2025-05-07T20:32:05.8639013Z             scale_ub_tensor = None
2025-05-07T20:32:05.8639269Z     
2025-05-07T20:32:05.8639499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8639821Z             op = silu_mul_quant
2025-05-07T20:32:05.8640079Z             if compiled:
2025-05-07T20:32:05.8640330Z                 op = torch.compile(op)
2025-05-07T20:32:05.8640625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8640912Z     
2025-05-07T20:32:05.8641108Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.8641395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.8641692Z     
2025-05-07T20:32:05.8641939Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8642275Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.8642572Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.8642894Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.8643254Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.8643573Z     
2025-05-07T20:32:05.8643779Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.8643977Z 
2025-05-07T20:32:05.8644088Z moe/activation_test.py:126: 
2025-05-07T20:32:05.8644439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8644782Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.8645213Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.8646035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.8646817Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.8647387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8648093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8648798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.8649544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.8650299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.8650961Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.8651581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.8652186Z     fn()
2025-05-07T20:32:05.8652715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.8653309Z     self.fn.run(
2025-05-07T20:32:05.8653793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8654339Z     kernel = self.compile(
2025-05-07T20:32:05.8654888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8655562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8655971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8656207Z 
2025-05-07T20:32:05.8656429Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d6221730>
2025-05-07T20:32:05.8657635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8659199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4960c20>}
2025-05-07T20:32:05.8660592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8661656Z context = <triton._C.libtriton.ir.context object at 0x7f96d4e37c70>
2025-05-07T20:32:05.8661957Z 
2025-05-07T20:32:05.8662128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8662676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8663158Z                            module_map=module_map)
2025-05-07T20:32:05.8663539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8663907Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.8664176Z E       ^
2025-05-07T20:32:05.8664656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8665128Z 
2025-05-07T20:32:05.8665555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8666085Z 
2025-05-07T20:32:05.8666196Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8666616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8667030Z     T=2048,
2025-05-07T20:32:05.8667224Z     D=5120,
2025-05-07T20:32:05.8667509Z     scale_ub=1200.0,
2025-05-07T20:32:05.8667738Z     contiguous=True,
2025-05-07T20:32:05.8667969Z     compiled=False,
2025-05-07T20:32:05.8668173Z )
2025-05-07T20:32:06.3050387Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.3051547Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:06.3053022Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.3054612Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.3055641Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3057020Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.3058481Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.3059514Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3060807Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.3062633Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.3063765Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3065121Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.3066445Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:06.3067751Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.3069027Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:06.3069901Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3071006Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.3072085Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:06.3072921Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.3074369Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.3075730Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.3076916Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.3078018Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:06.3079266Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.3080716Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.3081841Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.3082803Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.3083577Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:06.3084660Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.3934965Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.3936438Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:06.3937840Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.3939346Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.3940357Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3941733Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.3943176Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.3944201Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3945479Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.3947050Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.3948161Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3949499Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.3950813Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:06.3952085Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.3953350Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:06.3954209Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.3955277Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:06.3956339Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:06.3957166Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:06.3958427Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.3959854Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.3961019Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.3962110Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:06.3963339Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.3964771Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.3965889Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.3966839Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.3967609Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:06.3968664Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8475628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.8476563Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:06.8476858Z 
2025-05-07T20:32:06.8476938Z     @given(
2025-05-07T20:32:06.8477173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.8477502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.8477804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.8478139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.8478473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.8478757Z     )
2025-05-07T20:32:06.8479113Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.8479571Z     def test_silu_mul_quant(
2025-05-07T20:32:06.8479812Z         self,
2025-05-07T20:32:06.8480012Z         T: int,
2025-05-07T20:32:06.8480214Z         D: int,
2025-05-07T20:32:06.8480431Z         scale_ub: Optional[float],
2025-05-07T20:32:06.8480710Z         contiguous: bool,
2025-05-07T20:32:06.8480961Z         compiled: bool,
2025-05-07T20:32:06.8481194Z     ) -> None:
2025-05-07T20:32:06.8481408Z         torch.manual_seed(2025)
2025-05-07T20:32:06.8481654Z     
2025-05-07T20:32:06.8481937Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.8482281Z     
2025-05-07T20:32:06.8482479Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.8482775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.8483087Z         x = x_sign * x_clamp
2025-05-07T20:32:06.8483335Z         x0 = x[:, :D]
2025-05-07T20:32:06.8483557Z         x1 = x[:, D:]
2025-05-07T20:32:06.8483763Z     
2025-05-07T20:32:06.8483956Z         if contiguous:
2025-05-07T20:32:06.8484197Z             x0 = x0.contiguous()
2025-05-07T20:32:06.8484455Z             x1 = x1.contiguous()
2025-05-07T20:32:06.8484706Z     
2025-05-07T20:32:06.8484905Z         if scale_ub is not None:
2025-05-07T20:32:06.8485178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.8485526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.8485842Z             )
2025-05-07T20:32:06.8486213Z         else:
2025-05-07T20:32:06.8486423Z             scale_ub_tensor = None
2025-05-07T20:32:06.8486684Z     
2025-05-07T20:32:06.8487081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.8487402Z             op = silu_mul_quant
2025-05-07T20:32:06.8487658Z             if compiled:
2025-05-07T20:32:06.8487914Z                 op = torch.compile(op)
2025-05-07T20:32:06.8488211Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.8488491Z     
2025-05-07T20:32:06.8488689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:06.8488858Z 
2025-05-07T20:32:06.8488959Z moe/activation_test.py:117: 
2025-05-07T20:32:06.8489263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8489603Z moe/activation_test.py:115: in fn
2025-05-07T20:32:06.8489885Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.8490611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:06.8491328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:06.8491944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.8492645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.8493333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.8493885Z     kernel = self.compile(
2025-05-07T20:32:06.8494446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.8495116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.8495526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8495849Z 
2025-05-07T20:32:06.8496069Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d4af90a0>
2025-05-07T20:32:06.8497198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.8498640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4820180>}
2025-05-07T20:32:06.8500038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.8501102Z context = <triton._C.libtriton.ir.context object at 0x7f96cef92cb0>
2025-05-07T20:32:06.8501398Z 
2025-05-07T20:32:06.8501578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.8502112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.8502599Z                            module_map=module_map)
2025-05-07T20:32:06.8502975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.8503340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.8503602Z E       ^
2025-05-07T20:32:06.8504085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8504602Z 
2025-05-07T20:32:06.8505037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.8505568Z 
2025-05-07T20:32:06.8505674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.8506100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.8506677Z     T=2048,
2025-05-07T20:32:06.8506872Z     D=5120,
2025-05-07T20:32:06.8507067Z     scale_ub=1200.0,
2025-05-07T20:32:06.8507295Z     contiguous=True,
2025-05-07T20:32:06.8507522Z     compiled=True,
2025-05-07T20:32:06.8507730Z )
2025-05-07T20:32:06.8508190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.8508710Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:06.8508991Z 
2025-05-07T20:32:06.8509069Z     @given(
2025-05-07T20:32:06.8509305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.8509629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.8509936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.8510273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.8510613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.8510907Z     )
2025-05-07T20:32:06.8511258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.8511717Z     def test_silu_mul_quant(
2025-05-07T20:32:06.8511968Z         self,
2025-05-07T20:32:06.8512162Z         T: int,
2025-05-07T20:32:06.8512364Z         D: int,
2025-05-07T20:32:06.8512592Z         scale_ub: Optional[float],
2025-05-07T20:32:06.8512864Z         contiguous: bool,
2025-05-07T20:32:06.8513109Z         compiled: bool,
2025-05-07T20:32:06.8513336Z     ) -> None:
2025-05-07T20:32:06.8513548Z         torch.manual_seed(2025)
2025-05-07T20:32:06.8513795Z     
2025-05-07T20:32:06.8514071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.8514412Z     
2025-05-07T20:32:06.8514610Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.8514949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.8515264Z         x = x_sign * x_clamp
2025-05-07T20:32:06.8515509Z         x0 = x[:, :D]
2025-05-07T20:32:06.8515730Z         x1 = x[:, D:]
2025-05-07T20:32:06.8515941Z     
2025-05-07T20:32:06.8516256Z         if contiguous:
2025-05-07T20:32:06.8516497Z             x0 = x0.contiguous()
2025-05-07T20:32:06.8516761Z             x1 = x1.contiguous()
2025-05-07T20:32:06.8517000Z     
2025-05-07T20:32:06.8517196Z         if scale_ub is not None:
2025-05-07T20:32:06.8517480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.8517814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.8518130Z             )
2025-05-07T20:32:06.8518327Z         else:
2025-05-07T20:32:06.8518536Z             scale_ub_tensor = None
2025-05-07T20:32:06.8518799Z     
2025-05-07T20:32:06.8519035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.8519349Z             op = silu_mul_quant
2025-05-07T20:32:06.8519607Z             if compiled:
2025-05-07T20:32:06.8519860Z                 op = torch.compile(op)
2025-05-07T20:32:06.8520155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.8520437Z     
2025-05-07T20:32:06.8520635Z         y_fp8, y_scale = fn()
2025-05-07T20:32:06.8520935Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:06.8521229Z     
2025-05-07T20:32:06.8521473Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.8521820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:06.8522115Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:06.8522437Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:06.8522803Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.8523116Z     
2025-05-07T20:32:06.8523329Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:06.8523527Z 
2025-05-07T20:32:06.8523635Z moe/activation_test.py:126: 
2025-05-07T20:32:06.8523935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8524277Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:06.8524626Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.8525503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:06.8533999Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:06.8534763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.8535479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.8536190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:06.8536938Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.8537696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:06.8538360Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:06.8538977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:06.8539519Z     fn()
2025-05-07T20:32:06.8540053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:06.8540661Z     self.fn.run(
2025-05-07T20:32:06.8541139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.8541692Z     kernel = self.compile(
2025-05-07T20:32:06.8542260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.8542935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.8543350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8543588Z 
2025-05-07T20:32:06.8543810Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d45f5640>
2025-05-07T20:32:06.8545029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.8546447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d45eaa20>}
2025-05-07T20:32:06.8547849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.8548921Z context = <triton._C.libtriton.ir.context object at 0x7f96ced67d70>
2025-05-07T20:32:06.8549219Z 
2025-05-07T20:32:06.8549398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.8549933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.8550417Z                            module_map=module_map)
2025-05-07T20:32:06.8550795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.8551166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:06.8551432Z E       ^
2025-05-07T20:32:06.8551910Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8552381Z 
2025-05-07T20:32:06.8552809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.8553338Z 
2025-05-07T20:32:06.8553451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.8553871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.8554284Z     T=16384,
2025-05-07T20:32:06.8554483Z     D=7168,
2025-05-07T20:32:06.8554674Z     scale_ub=1200.0,
2025-05-07T20:32:06.8554906Z     contiguous=False,
2025-05-07T20:32:06.8555147Z     compiled=False,
2025-05-07T20:32:06.8555358Z )
2025-05-07T20:32:07.0954429Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.0956284Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:07.0958813Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.0961498Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.0963322Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.0965596Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.0967032Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.0968049Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.0969320Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.0970875Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.0972113Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.0973448Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.0974748Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:07.0976014Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.0977282Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:07.0978133Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.0979192Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:07.0980249Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:07.0981070Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:07.0982323Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.0983734Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.0984894Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.0985977Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:07.0987203Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.0988612Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.0989718Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.0990662Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.0991429Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:07.0992479Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.1578194Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.1579467Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:07.1580852Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.1582319Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.1583328Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.1584683Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.1586124Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.1587140Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.1588411Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.1589849Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.1591377Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.1592709Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.1594000Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:07.1595309Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.1596566Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:07.1597416Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:07.1598468Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:07.1599521Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:07.1600337Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:07.1601579Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.1603028Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.1604182Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.1605310Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:07.1606693Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.1608096Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.1609207Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.1610146Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.1610911Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:07.1612013Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6565750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6566341Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:07.6566646Z 
2025-05-07T20:32:07.6566726Z     @given(
2025-05-07T20:32:07.6566964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6567279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6567748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6568089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6568420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6568708Z     )
2025-05-07T20:32:07.6569059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6569515Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6569764Z         self,
2025-05-07T20:32:07.6569957Z         T: int,
2025-05-07T20:32:07.6570158Z         D: int,
2025-05-07T20:32:07.6570380Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6570649Z         contiguous: bool,
2025-05-07T20:32:07.6570890Z         compiled: bool,
2025-05-07T20:32:07.6571119Z     ) -> None:
2025-05-07T20:32:07.6571342Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6571586Z     
2025-05-07T20:32:07.6571923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6572268Z     
2025-05-07T20:32:07.6572472Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6572770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6573080Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6573326Z         x0 = x[:, :D]
2025-05-07T20:32:07.6573547Z         x1 = x[:, D:]
2025-05-07T20:32:07.6573758Z     
2025-05-07T20:32:07.6573942Z         if contiguous:
2025-05-07T20:32:07.6574182Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6574446Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6574684Z     
2025-05-07T20:32:07.6574877Z         if scale_ub is not None:
2025-05-07T20:32:07.6575154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6575489Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6575935Z             )
2025-05-07T20:32:07.6576133Z         else:
2025-05-07T20:32:07.6576343Z             scale_ub_tensor = None
2025-05-07T20:32:07.6576599Z     
2025-05-07T20:32:07.6576835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6577155Z             op = silu_mul_quant
2025-05-07T20:32:07.6577413Z             if compiled:
2025-05-07T20:32:07.6577666Z                 op = torch.compile(op)
2025-05-07T20:32:07.6577961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6578243Z     
2025-05-07T20:32:07.6578447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.6578614Z 
2025-05-07T20:32:07.6578725Z moe/activation_test.py:117: 
2025-05-07T20:32:07.6579025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6579368Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.6579657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6580364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.6581087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.6581646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6582353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6583036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6583590Z     kernel = self.compile(
2025-05-07T20:32:07.6584151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6584830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6585246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6585486Z 
2025-05-07T20:32:07.6585696Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d447f0e0>
2025-05-07T20:32:07.6586910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6588331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf06b6a0>}
2025-05-07T20:32:07.6589721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6590783Z context = <triton._C.libtriton.ir.context object at 0x7f96ce521970>
2025-05-07T20:32:07.6591084Z 
2025-05-07T20:32:07.6591260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6591812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6592291Z                            module_map=module_map)
2025-05-07T20:32:07.6592675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6593040Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.6593302Z E       ^
2025-05-07T20:32:07.6593783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6594250Z 
2025-05-07T20:32:07.6594688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.6595223Z 
2025-05-07T20:32:07.6595334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.6595755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.6596172Z     T=1,
2025-05-07T20:32:07.6596361Z     D=7168,
2025-05-07T20:32:07.6596557Z     scale_ub=None,
2025-05-07T20:32:07.6596864Z     contiguous=True,
2025-05-07T20:32:07.6597093Z     compiled=True,
2025-05-07T20:32:07.6597301Z )
2025-05-07T20:32:07.6597632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6598143Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.6598410Z 
2025-05-07T20:32:07.6598490Z     @given(
2025-05-07T20:32:07.6598734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6599061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6599370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6599706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6600045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6600337Z     )
2025-05-07T20:32:07.6600692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6601153Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6601411Z         self,
2025-05-07T20:32:07.6601607Z         T: int,
2025-05-07T20:32:07.6601806Z         D: int,
2025-05-07T20:32:07.6602034Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6602309Z         contiguous: bool,
2025-05-07T20:32:07.6602560Z         compiled: bool,
2025-05-07T20:32:07.6602788Z     ) -> None:
2025-05-07T20:32:07.6603004Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6603249Z     
2025-05-07T20:32:07.6603528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6603872Z     
2025-05-07T20:32:07.6604068Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6604365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6604681Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6604920Z         x0 = x[:, :D]
2025-05-07T20:32:07.6605139Z         x1 = x[:, D:]
2025-05-07T20:32:07.6605351Z     
2025-05-07T20:32:07.6605534Z         if contiguous:
2025-05-07T20:32:07.6605768Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6606036Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6606430Z     
2025-05-07T20:32:07.6606625Z         if scale_ub is not None:
2025-05-07T20:32:07.6606903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6607362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6607680Z             )
2025-05-07T20:32:07.6607882Z         else:
2025-05-07T20:32:07.6608090Z             scale_ub_tensor = None
2025-05-07T20:32:07.6608343Z     
2025-05-07T20:32:07.6608580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6608899Z             op = silu_mul_quant
2025-05-07T20:32:07.6609155Z             if compiled:
2025-05-07T20:32:07.6609406Z                 op = torch.compile(op)
2025-05-07T20:32:07.6609708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6609981Z     
2025-05-07T20:32:07.6610177Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.6610470Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.6610769Z     
2025-05-07T20:32:07.6611011Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6611354Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.6611652Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.6612011Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.6612376Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.6612686Z     
2025-05-07T20:32:07.6612893Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.6613089Z 
2025-05-07T20:32:07.6613196Z moe/activation_test.py:126: 
2025-05-07T20:32:07.6613502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6613840Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.6614178Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.6615020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.6615940Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.6616510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6617217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6617935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.6618676Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.6619434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.6620100Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.6620728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.6621269Z     fn()
2025-05-07T20:32:07.6621794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.6622400Z     self.fn.run(
2025-05-07T20:32:07.6622883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6623433Z     kernel = self.compile(
2025-05-07T20:32:07.6623990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6624665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6625067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6625308Z 
2025-05-07T20:32:07.6625519Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf2f6720>
2025-05-07T20:32:07.6626641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6628230Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec65620>}
2025-05-07T20:32:07.6629627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6630697Z context = <triton._C.libtriton.ir.context object at 0x7f96cea97430>
2025-05-07T20:32:07.6631001Z 
2025-05-07T20:32:07.6631170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6631712Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6632192Z                            module_map=module_map)
2025-05-07T20:32:07.6632569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6632935Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.6633200Z E       ^
2025-05-07T20:32:07.6633685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6634160Z 
2025-05-07T20:32:07.6634640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.6635174Z 
2025-05-07T20:32:07.6635285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.6635706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.6636121Z     T=4096,
2025-05-07T20:32:07.6636315Z     D=5120,
2025-05-07T20:32:07.6636503Z     scale_ub=None,
2025-05-07T20:32:07.6636723Z     contiguous=False,
2025-05-07T20:32:07.6636954Z     compiled=False,
2025-05-07T20:32:07.6637170Z )
2025-05-07T20:32:08.1114208Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.1115319Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:08.1116703Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.1118169Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.1119166Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.1120526Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.1121963Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.1122980Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.1124249Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.1125676Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.1126950Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.1128286Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.1129590Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:08.1130864Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.1132176Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:08.1133046Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.1134112Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.1135223Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:08.1136041Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.1137303Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.1138779Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.1139943Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.1141028Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:08.1142252Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.1143666Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.1144824Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.1145778Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.1146548Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:08.1147603Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3224890Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.3227088Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:08.3230127Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.3233053Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.3234798Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.3236150Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.3237590Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3238602Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.3239867Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.3241287Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3242504Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.3243834Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.3245182Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:08.3246456Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.3247712Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:08.3248572Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.3249642Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:08.3250703Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:08.3251528Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:08.3252819Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.3254151Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.3255390Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.3256468Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:08.3257679Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.3259076Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.3260174Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3261116Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3261873Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:08.3262915Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9081633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9082743Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.9083321Z 
2025-05-07T20:32:08.9083476Z     @given(
2025-05-07T20:32:08.9083925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9084869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9085264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9085625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9085946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9086229Z     )
2025-05-07T20:32:08.9086578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9087025Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9087261Z         self,
2025-05-07T20:32:08.9087455Z         T: int,
2025-05-07T20:32:08.9087652Z         D: int,
2025-05-07T20:32:08.9087860Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9088128Z         contiguous: bool,
2025-05-07T20:32:08.9088370Z         compiled: bool,
2025-05-07T20:32:08.9088597Z     ) -> None:
2025-05-07T20:32:08.9095097Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9095352Z     
2025-05-07T20:32:08.9095636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9095991Z     
2025-05-07T20:32:08.9096184Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9096486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9096797Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9097044Z         x0 = x[:, :D]
2025-05-07T20:32:08.9097267Z         x1 = x[:, D:]
2025-05-07T20:32:08.9097480Z     
2025-05-07T20:32:08.9097666Z         if contiguous:
2025-05-07T20:32:08.9097901Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9098167Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9098407Z     
2025-05-07T20:32:08.9098601Z         if scale_ub is not None:
2025-05-07T20:32:08.9098878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9099219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9099545Z             )
2025-05-07T20:32:08.9099753Z         else:
2025-05-07T20:32:08.9099966Z             scale_ub_tensor = None
2025-05-07T20:32:08.9100232Z     
2025-05-07T20:32:08.9100474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9100791Z             op = silu_mul_quant
2025-05-07T20:32:08.9101049Z             if compiled:
2025-05-07T20:32:08.9101457Z                 op = torch.compile(op)
2025-05-07T20:32:08.9101773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9102048Z     
2025-05-07T20:32:08.9102247Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9102415Z 
2025-05-07T20:32:08.9102525Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9102834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9103180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9103469Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9104188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9104936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9105511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9106535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9107304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9107858Z     kernel = self.compile(
2025-05-07T20:32:08.9108417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9109092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9109499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9109740Z 
2025-05-07T20:32:08.9109954Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf0c3d40>
2025-05-07T20:32:08.9111085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9112663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec665c0>}
2025-05-07T20:32:08.9114050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9115113Z context = <triton._C.libtriton.ir.context object at 0x7f96ce8e7370>
2025-05-07T20:32:08.9115417Z 
2025-05-07T20:32:08.9115588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9116132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9116616Z                            module_map=module_map)
2025-05-07T20:32:08.9116993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9117355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9117618Z E       ^
2025-05-07T20:32:08.9118099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9118570Z 
2025-05-07T20:32:08.9118999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9119530Z 
2025-05-07T20:32:08.9119643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9120070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9120479Z     T=4096,
2025-05-07T20:32:08.9120672Z     D=7168,
2025-05-07T20:32:08.9120870Z     scale_ub=None,
2025-05-07T20:32:08.9121085Z     contiguous=False,
2025-05-07T20:32:08.9121320Z     compiled=False,
2025-05-07T20:32:08.9121534Z )
2025-05-07T20:32:08.9121859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9122370Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.9122771Z 
2025-05-07T20:32:08.9122857Z     @given(
2025-05-07T20:32:08.9123089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9123411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9123726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9124064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9124395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9124686Z     )
2025-05-07T20:32:08.9125071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9125544Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9125800Z         self,
2025-05-07T20:32:08.9126003Z         T: int,
2025-05-07T20:32:08.9126208Z         D: int,
2025-05-07T20:32:08.9126430Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9126709Z         contiguous: bool,
2025-05-07T20:32:08.9126957Z         compiled: bool,
2025-05-07T20:32:08.9127184Z     ) -> None:
2025-05-07T20:32:08.9127411Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9127657Z     
2025-05-07T20:32:08.9127938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9128289Z     
2025-05-07T20:32:08.9128484Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9128778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9129098Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9129342Z         x0 = x[:, :D]
2025-05-07T20:32:08.9129563Z         x1 = x[:, D:]
2025-05-07T20:32:08.9129771Z     
2025-05-07T20:32:08.9129963Z         if contiguous:
2025-05-07T20:32:08.9130193Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9130457Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9130700Z     
2025-05-07T20:32:08.9130976Z         if scale_ub is not None:
2025-05-07T20:32:08.9131253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9131593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9131956Z             )
2025-05-07T20:32:08.9132160Z         else:
2025-05-07T20:32:08.9132376Z             scale_ub_tensor = None
2025-05-07T20:32:08.9132627Z     
2025-05-07T20:32:08.9132867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9133190Z             op = silu_mul_quant
2025-05-07T20:32:08.9133440Z             if compiled:
2025-05-07T20:32:08.9133696Z                 op = torch.compile(op)
2025-05-07T20:32:08.9133999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9134284Z     
2025-05-07T20:32:08.9134477Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9134651Z 
2025-05-07T20:32:08.9134754Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9135074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9135459Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9135751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9136467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9137175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9137729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9138434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9139121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9139670Z     kernel = self.compile(
2025-05-07T20:32:08.9140226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9140904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9141316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9141553Z 
2025-05-07T20:32:08.9141849Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf001880>
2025-05-07T20:32:08.9142980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9144405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec676a0>}
2025-05-07T20:32:08.9145800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9146858Z context = <triton._C.libtriton.ir.context object at 0x7f96ce9882b0>
2025-05-07T20:32:08.9147160Z 
2025-05-07T20:32:08.9147331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9147875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9148355Z                            module_map=module_map)
2025-05-07T20:32:08.9148723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9149087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9149349Z E       ^
2025-05-07T20:32:08.9149827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9150302Z 
2025-05-07T20:32:08.9150730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9151266Z 
2025-05-07T20:32:08.9151372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9151883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9152297Z     T=128,
2025-05-07T20:32:08.9152491Z     D=7168,
2025-05-07T20:32:08.9152688Z     scale_ub=None,
2025-05-07T20:32:08.9152909Z     contiguous=False,
2025-05-07T20:32:08.9153141Z     compiled=True,
2025-05-07T20:32:08.9153348Z )
2025-05-07T20:32:08.9702829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9703908Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.9704481Z 
2025-05-07T20:32:08.9704642Z     @given(
2025-05-07T20:32:08.9704961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9705323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9705625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9705957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9706418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9706711Z     )
2025-05-07T20:32:08.9707068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9707519Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9707770Z         self,
2025-05-07T20:32:08.9707970Z         T: int,
2025-05-07T20:32:08.9708172Z         D: int,
2025-05-07T20:32:08.9708388Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9708654Z         contiguous: bool,
2025-05-07T20:32:08.9708896Z         compiled: bool,
2025-05-07T20:32:08.9709120Z     ) -> None:
2025-05-07T20:32:08.9709332Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9709574Z     
2025-05-07T20:32:08.9709851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9710192Z     
2025-05-07T20:32:08.9710387Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9710676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9710981Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9711233Z         x0 = x[:, :D]
2025-05-07T20:32:08.9711451Z         x1 = x[:, D:]
2025-05-07T20:32:08.9711655Z     
2025-05-07T20:32:08.9711846Z         if contiguous:
2025-05-07T20:32:08.9712080Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9712497Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9712741Z     
2025-05-07T20:32:08.9712937Z         if scale_ub is not None:
2025-05-07T20:32:08.9713208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9713541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9713858Z             )
2025-05-07T20:32:08.9714050Z         else:
2025-05-07T20:32:08.9714258Z             scale_ub_tensor = None
2025-05-07T20:32:08.9714513Z     
2025-05-07T20:32:08.9714746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9715060Z             op = silu_mul_quant
2025-05-07T20:32:08.9715312Z             if compiled:
2025-05-07T20:32:08.9715560Z                 op = torch.compile(op)
2025-05-07T20:32:08.9715858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9716139Z     
2025-05-07T20:32:08.9716340Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.9716628Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.9716923Z     
2025-05-07T20:32:08.9717161Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9717497Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.9717787Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.9718104Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.9718471Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.9718778Z     
2025-05-07T20:32:08.9718983Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.9719186Z 
2025-05-07T20:32:08.9719287Z moe/activation_test.py:126: 
2025-05-07T20:32:08.9719591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9719930Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.9720411Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.9721226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.9721998Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.9722555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9723255Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9723966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.9724704Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.9725455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.9726119Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.9726743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.9727276Z     fn()
2025-05-07T20:32:08.9727803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.9728405Z     self.fn.run(
2025-05-07T20:32:08.9728879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9729428Z     kernel = self.compile(
2025-05-07T20:32:08.9729982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9730652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9731057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9731301Z 
2025-05-07T20:32:08.9731515Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf0035f0>
2025-05-07T20:32:08.9732786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9734211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ceaf39c0>}
2025-05-07T20:32:08.9735647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9736713Z context = <triton._C.libtriton.ir.context object at 0x7f96ce7a3470>
2025-05-07T20:32:08.9737021Z 
2025-05-07T20:32:08.9737192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9739966Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9740461Z                            module_map=module_map)
2025-05-07T20:32:08.9740845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9741205Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.9741477Z E       ^
2025-05-07T20:32:08.9741961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9742428Z 
2025-05-07T20:32:08.9742871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9743409Z 
2025-05-07T20:32:08.9743514Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9743943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9744359Z     T=128,
2025-05-07T20:32:08.9744652Z     D=7168,
2025-05-07T20:32:08.9744848Z     scale_ub=None,
2025-05-07T20:32:08.9745103Z     contiguous=False,
2025-05-07T20:32:08.9745351Z     compiled=False,
2025-05-07T20:32:08.9745557Z )
2025-05-07T20:32:09.1694861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.1695513Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:09.1695812Z 
2025-05-07T20:32:09.1695895Z     @given(
2025-05-07T20:32:09.1696133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.1696454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.1696760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.1697096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.1697435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.1697718Z     )
2025-05-07T20:32:09.1698074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.1698540Z     def test_silu_mul_quant(
2025-05-07T20:32:09.1698785Z         self,
2025-05-07T20:32:09.1698986Z         T: int,
2025-05-07T20:32:09.1699187Z         D: int,
2025-05-07T20:32:09.1699412Z         scale_ub: Optional[float],
2025-05-07T20:32:09.1699688Z         contiguous: bool,
2025-05-07T20:32:09.1699935Z         compiled: bool,
2025-05-07T20:32:09.1700167Z     ) -> None:
2025-05-07T20:32:09.1700382Z         torch.manual_seed(2025)
2025-05-07T20:32:09.1700635Z     
2025-05-07T20:32:09.1700918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.1701264Z     
2025-05-07T20:32:09.1701461Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.1701758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.1702076Z         x = x_sign * x_clamp
2025-05-07T20:32:09.1702328Z         x0 = x[:, :D]
2025-05-07T20:32:09.1702551Z         x1 = x[:, D:]
2025-05-07T20:32:09.1702763Z     
2025-05-07T20:32:09.1702957Z         if contiguous:
2025-05-07T20:32:09.1703200Z             x0 = x0.contiguous()
2025-05-07T20:32:09.1703459Z             x1 = x1.contiguous()
2025-05-07T20:32:09.1703702Z     
2025-05-07T20:32:09.1703900Z         if scale_ub is not None:
2025-05-07T20:32:09.1704348Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.1704699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.1705020Z             )
2025-05-07T20:32:09.1705213Z         else:
2025-05-07T20:32:09.1705432Z             scale_ub_tensor = None
2025-05-07T20:32:09.1705696Z     
2025-05-07T20:32:09.1705937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1706412Z             op = silu_mul_quant
2025-05-07T20:32:09.1706671Z             if compiled:
2025-05-07T20:32:09.1706927Z                 op = torch.compile(op)
2025-05-07T20:32:09.1707230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1707516Z     
2025-05-07T20:32:09.1707715Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.1707889Z 
2025-05-07T20:32:09.1707991Z moe/activation_test.py:117: 
2025-05-07T20:32:09.1708297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1708640Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.1708927Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1709650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.1710372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.1710930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.1711633Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.1712318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.1712868Z     kernel = self.compile(
2025-05-07T20:32:09.1713558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.1714234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.1714652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1714893Z 
2025-05-07T20:32:09.1715114Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce841100>
2025-05-07T20:32:09.1716233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.1717663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf259800>}
2025-05-07T20:32:09.1719057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.1720126Z context = <triton._C.libtriton.ir.context object at 0x7f96ce3f01f0>
2025-05-07T20:32:09.1720423Z 
2025-05-07T20:32:09.1720600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.1721136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.1721617Z                            module_map=module_map)
2025-05-07T20:32:09.1721986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.1722346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.1722610Z E       ^
2025-05-07T20:32:09.1723089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1723555Z 
2025-05-07T20:32:09.1723991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.1724529Z 
2025-05-07T20:32:09.1724633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.1725212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.1725673Z     T=4096,
2025-05-07T20:32:09.1725865Z     D=5120,
2025-05-07T20:32:09.1726056Z     scale_ub=1200.0,
2025-05-07T20:32:09.1726288Z     contiguous=True,
2025-05-07T20:32:09.1726515Z     compiled=False,
2025-05-07T20:32:09.1726717Z )
2025-05-07T20:32:09.1727043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.1727582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:09.1727866Z 
2025-05-07T20:32:09.1727947Z     @given(
2025-05-07T20:32:09.1728174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.1728493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.1728810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.1729143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.1729479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.1729770Z     )
2025-05-07T20:32:09.1730138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.1730588Z     def test_silu_mul_quant(
2025-05-07T20:32:09.1730836Z         self,
2025-05-07T20:32:09.1731035Z         T: int,
2025-05-07T20:32:09.1731230Z         D: int,
2025-05-07T20:32:09.1731451Z         scale_ub: Optional[float],
2025-05-07T20:32:09.1731732Z         contiguous: bool,
2025-05-07T20:32:09.1732033Z         compiled: bool,
2025-05-07T20:32:09.1732258Z     ) -> None:
2025-05-07T20:32:09.1732476Z         torch.manual_seed(2025)
2025-05-07T20:32:09.1732719Z     
2025-05-07T20:32:09.1733000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.1733348Z     
2025-05-07T20:32:09.1733629Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.1733925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.1734240Z         x = x_sign * x_clamp
2025-05-07T20:32:09.1734485Z         x0 = x[:, :D]
2025-05-07T20:32:09.1734712Z         x1 = x[:, D:]
2025-05-07T20:32:09.1734922Z     
2025-05-07T20:32:09.1735129Z         if contiguous:
2025-05-07T20:32:09.1735388Z             x0 = x0.contiguous()
2025-05-07T20:32:09.1735669Z             x1 = x1.contiguous()
2025-05-07T20:32:09.1735913Z     
2025-05-07T20:32:09.1736102Z         if scale_ub is not None:
2025-05-07T20:32:09.1736377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.1736717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.1737027Z             )
2025-05-07T20:32:09.1737225Z         else:
2025-05-07T20:32:09.1737443Z             scale_ub_tensor = None
2025-05-07T20:32:09.1737690Z     
2025-05-07T20:32:09.1737920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1738251Z             op = silu_mul_quant
2025-05-07T20:32:09.1738503Z             if compiled:
2025-05-07T20:32:09.1738759Z                 op = torch.compile(op)
2025-05-07T20:32:09.1739062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1739341Z     
2025-05-07T20:32:09.1739544Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.1739712Z 
2025-05-07T20:32:09.1739818Z moe/activation_test.py:117: 
2025-05-07T20:32:09.1740120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1746932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.1747274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1747999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.1748710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.1749266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.1749975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.1750771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.1751331Z     kernel = self.compile(
2025-05-07T20:32:09.1751898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.1752573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.1752988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1753233Z 
2025-05-07T20:32:09.1753448Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce843470>
2025-05-07T20:32:09.1754583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.1756061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d49fb380>}
2025-05-07T20:32:09.1757455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.1758517Z context = <triton._C.libtriton.ir.context object at 0x7f96399eadb0>
2025-05-07T20:32:09.1758815Z 
2025-05-07T20:32:09.1758992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.1759532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.1760006Z                            module_map=module_map)
2025-05-07T20:32:09.1760382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.1760829Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.1761093Z E       ^
2025-05-07T20:32:09.1761578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1762051Z 
2025-05-07T20:32:09.1762483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.1763012Z 
2025-05-07T20:32:09.1763122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.1763547Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.1763960Z     T=1,
2025-05-07T20:32:09.1764147Z     D=5120,
2025-05-07T20:32:09.1764345Z     scale_ub=None,
2025-05-07T20:32:09.1764561Z     contiguous=True,
2025-05-07T20:32:09.1764786Z     compiled=True,
2025-05-07T20:32:09.1764995Z )
2025-05-07T20:32:09.4060909Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.4062046Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:09.4063461Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.4064961Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.4065980Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4067346Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.4068980Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4070019Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4071295Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.4072740Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4073854Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4075192Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.4076494Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:09.4077772Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.4079031Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:09.4080011Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4081080Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.4082142Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:09.4082967Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.4084232Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.4085626Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.4086805Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.4087893Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:09.4089123Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.4090537Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.4091651Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4092784Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4093556Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:09.4094616Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4751453Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.4752553Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:09.4753950Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.4755481Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.4756500Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4757855Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.4759457Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4760481Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4761762Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.4763200Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4764306Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4765705Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.4767003Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:09.4768275Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.4769547Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:09.4770414Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.4771589Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.4772712Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:09.4773541Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.4774804Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.4776193Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.4777367Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.4778458Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:09.4779691Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.4781109Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.4782219Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4783279Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4784060Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:09.4785124Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7682244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7682755Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.7683028Z 
2025-05-07T20:32:09.7683103Z     @given(
2025-05-07T20:32:09.7683341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7683657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7683969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7684324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7684655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7684944Z     )
2025-05-07T20:32:09.7685298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7685805Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7686049Z         self,
2025-05-07T20:32:09.7686242Z         T: int,
2025-05-07T20:32:09.7686442Z         D: int,
2025-05-07T20:32:09.7686660Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7686929Z         contiguous: bool,
2025-05-07T20:32:09.7687175Z         compiled: bool,
2025-05-07T20:32:09.7687399Z     ) -> None:
2025-05-07T20:32:09.7687616Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7687860Z     
2025-05-07T20:32:09.7688134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7688488Z     
2025-05-07T20:32:09.7688677Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7688969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7689282Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7689693Z         x0 = x[:, :D]
2025-05-07T20:32:09.7689920Z         x1 = x[:, D:]
2025-05-07T20:32:09.7690128Z     
2025-05-07T20:32:09.7690311Z         if contiguous:
2025-05-07T20:32:09.7690546Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7690809Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7691043Z     
2025-05-07T20:32:09.7691236Z         if scale_ub is not None:
2025-05-07T20:32:09.7691513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7691905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7692218Z             )
2025-05-07T20:32:09.7692410Z         else:
2025-05-07T20:32:09.7692620Z             scale_ub_tensor = None
2025-05-07T20:32:09.7692865Z     
2025-05-07T20:32:09.7693095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7693420Z             op = silu_mul_quant
2025-05-07T20:32:09.7693668Z             if compiled:
2025-05-07T20:32:09.7693925Z                 op = torch.compile(op)
2025-05-07T20:32:09.7694236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7694512Z     
2025-05-07T20:32:09.7694707Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.7694995Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.7695288Z     
2025-05-07T20:32:09.7695528Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7695871Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.7696166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.7696484Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.7696850Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7697165Z     
2025-05-07T20:32:09.7697363Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.7697694Z 
2025-05-07T20:32:09.7697795Z moe/activation_test.py:126: 
2025-05-07T20:32:09.7698097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7698436Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.7698770Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7699585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.7700366Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.7700922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7701626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7702338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.7703087Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.7703842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.7704505Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.7705164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.7705707Z     fn()
2025-05-07T20:32:09.7706378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.7706981Z     self.fn.run(
2025-05-07T20:32:09.7707455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7708003Z     kernel = self.compile(
2025-05-07T20:32:09.7708556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7709237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7709641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7709882Z 
2025-05-07T20:32:09.7710216Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cea305f0>
2025-05-07T20:32:09.7711346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7712777Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce7c0720>}
2025-05-07T20:32:09.7714171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7715258Z context = <triton._C.libtriton.ir.context object at 0x7f96ce421530>
2025-05-07T20:32:09.7715589Z 
2025-05-07T20:32:09.7715764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7716308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7716785Z                            module_map=module_map)
2025-05-07T20:32:09.7717155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7717517Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.7717784Z E       ^
2025-05-07T20:32:09.7718263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7718726Z 
2025-05-07T20:32:09.7719158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7719821Z 
2025-05-07T20:32:09.7719927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7720350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7720766Z     T=2048,
2025-05-07T20:32:09.7720954Z     D=5120,
2025-05-07T20:32:09.7721149Z     scale_ub=None,
2025-05-07T20:32:09.7721367Z     contiguous=True,
2025-05-07T20:32:09.7721587Z     compiled=True,
2025-05-07T20:32:09.7721792Z )
2025-05-07T20:32:09.9898020Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.9899128Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:09.9900520Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.9902010Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.9903011Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.9904363Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.9905801Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9906959Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.9908387Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.9909823Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9910923Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.9912251Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.9913558Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:09.9914825Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.9916136Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:09.9916988Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.9918053Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:09.9919237Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:09.9920069Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:09.9921320Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.9922656Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.9923820Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.9924913Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:09.9926196Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.9927611Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.9928724Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9929674Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9930447Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:09.9931908Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0583590Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.0584681Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:10.0586845Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.0589527Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.0591393Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.0593852Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.0596031Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0597055Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.0598493Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.0599927Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0601022Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.0602352Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.0603656Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:10.0604929Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.0606319Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:10.0607179Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.0608242Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.0609297Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:10.0610125Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.0611501Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.0612874Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.0614035Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.0615116Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:10.0616353Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.0617763Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.0618871Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0619816Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0620583Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:10.0621641Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3519336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3519906Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.3520193Z 
2025-05-07T20:32:10.3520274Z     @given(
2025-05-07T20:32:10.3520507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3520825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3521131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3521462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3521795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3522076Z     )
2025-05-07T20:32:10.3522432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3522886Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3523135Z         self,
2025-05-07T20:32:10.3523335Z         T: int,
2025-05-07T20:32:10.3523532Z         D: int,
2025-05-07T20:32:10.3523748Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3524024Z         contiguous: bool,
2025-05-07T20:32:10.3524275Z         compiled: bool,
2025-05-07T20:32:10.3524506Z     ) -> None:
2025-05-07T20:32:10.3524727Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3524981Z     
2025-05-07T20:32:10.3525266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3525616Z     
2025-05-07T20:32:10.3525815Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3526114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3526430Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3526678Z         x0 = x[:, :D]
2025-05-07T20:32:10.3526903Z         x1 = x[:, D:]
2025-05-07T20:32:10.3527111Z     
2025-05-07T20:32:10.3527306Z         if contiguous:
2025-05-07T20:32:10.3527546Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3527813Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3528064Z     
2025-05-07T20:32:10.3528264Z         if scale_ub is not None:
2025-05-07T20:32:10.3528540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3529051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3529376Z             )
2025-05-07T20:32:10.3529573Z         else:
2025-05-07T20:32:10.3529795Z             scale_ub_tensor = None
2025-05-07T20:32:10.3530055Z     
2025-05-07T20:32:10.3530296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3530616Z             op = silu_mul_quant
2025-05-07T20:32:10.3530875Z             if compiled:
2025-05-07T20:32:10.3531133Z                 op = torch.compile(op)
2025-05-07T20:32:10.3531438Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3531721Z     
2025-05-07T20:32:10.3531975Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.3532267Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.3532577Z     
2025-05-07T20:32:10.3532825Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3533164Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.3533472Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.3533800Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.3534167Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.3534487Z     
2025-05-07T20:32:10.3534698Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.3534898Z 
2025-05-07T20:32:10.3535010Z moe/activation_test.py:126: 
2025-05-07T20:32:10.3535312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3535662Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.3536008Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.3536821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.3537734Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.3538307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.3539017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.3539728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.3540477Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.3541231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.3541893Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.3542508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.3543050Z     fn()
2025-05-07T20:32:10.3543574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.3544174Z     self.fn.run(
2025-05-07T20:32:10.3544661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.3545207Z     kernel = self.compile(
2025-05-07T20:32:10.3545769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.3546441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3546853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3547092Z 
2025-05-07T20:32:10.3547318Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce825c10>
2025-05-07T20:32:10.3548442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.3549947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf3f05e0>}
2025-05-07T20:32:10.3551360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.3552449Z context = <triton._C.libtriton.ir.context object at 0x7f9639bc7970>
2025-05-07T20:32:10.3552748Z 
2025-05-07T20:32:10.3552923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.3553466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3553955Z                            module_map=module_map)
2025-05-07T20:32:10.3554342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3554712Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.3554986Z E       ^
2025-05-07T20:32:10.3555478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3555957Z 
2025-05-07T20:32:10.3556448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.3556991Z 
2025-05-07T20:32:10.3557102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3557533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3557953Z     T=128,
2025-05-07T20:32:10.3558154Z     D=5120,
2025-05-07T20:32:10.3558348Z     scale_ub=None,
2025-05-07T20:32:10.3558575Z     contiguous=True,
2025-05-07T20:32:10.3558806Z     compiled=True,
2025-05-07T20:32:10.3559011Z )
2025-05-07T20:32:10.5902831Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.5903944Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:10.5905327Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.5906995Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.5908002Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.5909360Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.5910793Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.5911804Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.5919007Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.5920510Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.5921793Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.5923120Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.5924421Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.5925693Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.5926949Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:10.5927816Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.5928880Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.5929936Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:10.5930754Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.5932074Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.5933538Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.5934703Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.5935789Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:10.5937005Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.5938414Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.5939522Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.5940470Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.5941237Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:10.5942291Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6596149Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.6597258Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:10.6598784Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.6600253Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.6601260Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6602598Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.6604042Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6605059Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6606483Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.6607919Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6609171Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6610497Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.6611833Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:10.6613103Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.6614358Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:10.6615218Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6616285Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:10.6617340Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:10.6618167Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:10.6619425Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.6620762Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.6622046Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.6623133Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:10.6624360Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.6625777Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.6626884Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6627835Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6628605Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:10.6629664Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9957398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9957944Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.9958223Z 
2025-05-07T20:32:10.9958303Z     @given(
2025-05-07T20:32:10.9958701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9959016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9959325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9959662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9959994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9960272Z     )
2025-05-07T20:32:10.9960626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9961077Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9961311Z         self,
2025-05-07T20:32:10.9961503Z         T: int,
2025-05-07T20:32:10.9961703Z         D: int,
2025-05-07T20:32:10.9961914Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9962184Z         contiguous: bool,
2025-05-07T20:32:10.9962426Z         compiled: bool,
2025-05-07T20:32:10.9962650Z     ) -> None:
2025-05-07T20:32:10.9962870Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9963117Z     
2025-05-07T20:32:10.9963395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9963750Z     
2025-05-07T20:32:10.9963945Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9964248Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9964560Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9964805Z         x0 = x[:, :D]
2025-05-07T20:32:10.9965026Z         x1 = x[:, D:]
2025-05-07T20:32:10.9965232Z     
2025-05-07T20:32:10.9965421Z         if contiguous:
2025-05-07T20:32:10.9965659Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9965916Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9966162Z     
2025-05-07T20:32:10.9966355Z         if scale_ub is not None:
2025-05-07T20:32:10.9966626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9966971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9967288Z             )
2025-05-07T20:32:10.9967480Z         else:
2025-05-07T20:32:10.9967701Z             scale_ub_tensor = None
2025-05-07T20:32:10.9967961Z     
2025-05-07T20:32:10.9968198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9968517Z             op = silu_mul_quant
2025-05-07T20:32:10.9968902Z             if compiled:
2025-05-07T20:32:10.9969152Z                 op = torch.compile(op)
2025-05-07T20:32:10.9969459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9969739Z     
2025-05-07T20:32:10.9969936Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.9970223Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.9970523Z     
2025-05-07T20:32:10.9970765Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9971103Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.9971405Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.9971729Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.9972145Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.9972467Z     
2025-05-07T20:32:10.9972671Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.9972869Z 
2025-05-07T20:32:10.9972978Z moe/activation_test.py:126: 
2025-05-07T20:32:10.9973285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9973630Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.9973965Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.9974775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.9975558Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.9976117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9976824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9977533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.9978365Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.9979126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.9979781Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.9980399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.9980929Z     fn()
2025-05-07T20:32:10.9981449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.9982046Z     self.fn.run(
2025-05-07T20:32:10.9982524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9983070Z     kernel = self.compile(
2025-05-07T20:32:10.9983626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9984297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9984710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9984947Z 
2025-05-07T20:32:10.9985163Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cea33950>
2025-05-07T20:32:10.9986334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9987758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a92520>}
2025-05-07T20:32:10.9989157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9990323Z context = <triton._C.libtriton.ir.context object at 0x7f9639c46830>
2025-05-07T20:32:10.9990624Z 
2025-05-07T20:32:10.9990801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9991339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9991825Z                            module_map=module_map)
2025-05-07T20:32:10.9992199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9992564Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.9992839Z E       ^
2025-05-07T20:32:10.9993320Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9993798Z 
2025-05-07T20:32:10.9994240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9994786Z 
2025-05-07T20:32:10.9994891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9995327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9995748Z     T=4096,
2025-05-07T20:32:10.9995936Z     D=5120,
2025-05-07T20:32:10.9996131Z     scale_ub=None,
2025-05-07T20:32:10.9996352Z     contiguous=True,
2025-05-07T20:32:10.9996574Z     compiled=True,
2025-05-07T20:32:10.9996778Z )
2025-05-07T20:32:11.2343313Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.2344429Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:11.2345829Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.2347508Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.2348522Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2349880Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.2351328Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2352365Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2353643Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.2355087Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2356253Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2357592Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.2359016Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:11.2360294Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.2361549Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:11.2362410Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.2363476Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:11.2364550Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:11.2365368Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:11.2366628Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.2367966Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.2369132Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:11.2370312Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:11.2371536Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.2373007Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.2374111Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2375060Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2375831Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:11.2376895Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.3036805Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.3037895Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:11.3039280Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.3040916Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.3041933Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.3043277Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.3044715Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.3045728Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.3047058Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.3048488Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.3049580Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.3050908Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.3052369Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:11.3053639Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.3054894Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:11.3055745Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.3056802Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:11.3057868Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:11.3058690Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:11.3059944Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.3061269Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.3062424Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:11.3063508Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:11.3064808Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.3066213Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.3067311Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.3068251Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.3069012Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:11.3070072Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6378203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6378755Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6379038Z 
2025-05-07T20:32:11.6379116Z     @given(
2025-05-07T20:32:11.6379347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
﻿2025-05-07T20:32:11.6383734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6384045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6384380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6384707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6384998Z     )
2025-05-07T20:32:11.6385352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6385895Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6386135Z         self,
2025-05-07T20:32:11.6386328Z         T: int,
2025-05-07T20:32:11.6386524Z         D: int,
2025-05-07T20:32:11.6386744Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6387015Z         contiguous: bool,
2025-05-07T20:32:11.6387259Z         compiled: bool,
2025-05-07T20:32:11.6387485Z     ) -> None:
2025-05-07T20:32:11.6387706Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6387951Z     
2025-05-07T20:32:11.6388227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6388598Z     
2025-05-07T20:32:11.6388792Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6389087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6389405Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6389647Z         x0 = x[:, :D]
2025-05-07T20:32:11.6389869Z         x1 = x[:, D:]
2025-05-07T20:32:11.6390083Z     
2025-05-07T20:32:11.6390270Z         if contiguous:
2025-05-07T20:32:11.6390512Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6390775Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6391014Z     
2025-05-07T20:32:11.6391214Z         if scale_ub is not None:
2025-05-07T20:32:11.6391492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6391831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6392149Z             )
2025-05-07T20:32:11.6392346Z         else:
2025-05-07T20:32:11.6392556Z             scale_ub_tensor = None
2025-05-07T20:32:11.6392811Z     
2025-05-07T20:32:11.6393054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6393379Z             op = silu_mul_quant
2025-05-07T20:32:11.6393629Z             if compiled:
2025-05-07T20:32:11.6393888Z                 op = torch.compile(op)
2025-05-07T20:32:11.6394198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6394474Z     
2025-05-07T20:32:11.6394674Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6394971Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6395264Z     
2025-05-07T20:32:11.6395514Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6396052Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6396352Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6396677Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6397043Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6397355Z     
2025-05-07T20:32:11.6397561Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6397769Z 
2025-05-07T20:32:11.6397872Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6398183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6398522Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6398857Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6399675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6400454Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6401017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6401723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6402435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6403268Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6404020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6404678Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6405341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6405898Z     fn()
2025-05-07T20:32:11.6406714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6407322Z     self.fn.run(
2025-05-07T20:32:11.6407798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6408347Z     kernel = self.compile(
2025-05-07T20:32:11.6408906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6409583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6409988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6410226Z 
2025-05-07T20:32:11.6410439Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf4053a0>
2025-05-07T20:32:11.6411570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6413044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a55300>}
2025-05-07T20:32:11.6414453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6415514Z context = <triton._C.libtriton.ir.context object at 0x7f9639db4c30>
2025-05-07T20:32:11.6415818Z 
2025-05-07T20:32:11.6415991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6416581Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6417068Z                            module_map=module_map)
2025-05-07T20:32:11.6417443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6417933Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6418211Z E       ^
2025-05-07T20:32:11.6418686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6419156Z 
2025-05-07T20:32:11.6419592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6420134Z 
2025-05-07T20:32:11.6420240Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6420666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6421081Z     T=16384,
2025-05-07T20:32:11.6421282Z     D=5120,
2025-05-07T20:32:11.6421485Z     scale_ub=None,
2025-05-07T20:32:11.6421703Z     contiguous=True,
2025-05-07T20:32:11.6421931Z     compiled=True,
2025-05-07T20:32:11.6422142Z )
2025-05-07T20:32:11.6673961Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:11.6676139Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:11.6677524Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:11.6678659Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:11.6679806Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:11.7544393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7545491Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.7545847Z 
2025-05-07T20:32:11.7545946Z     @given(
2025-05-07T20:32:11.7546209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7546528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7546836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7547177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7547517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7547807Z     )
2025-05-07T20:32:11.7548166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7548625Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7548871Z         self,
2025-05-07T20:32:11.7549072Z         T: int,
2025-05-07T20:32:11.7549275Z         D: int,
2025-05-07T20:32:11.7549495Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7549772Z         contiguous: bool,
2025-05-07T20:32:11.7550024Z         compiled: bool,
2025-05-07T20:32:11.7550251Z     ) -> None:
2025-05-07T20:32:11.7550479Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7550726Z     
2025-05-07T20:32:11.7551005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7551350Z     
2025-05-07T20:32:11.7551548Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7551848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7552165Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7552411Z         x0 = x[:, :D]
2025-05-07T20:32:11.7552632Z         x1 = x[:, D:]
2025-05-07T20:32:11.7552839Z     
2025-05-07T20:32:11.7553029Z         if contiguous:
2025-05-07T20:32:11.7553265Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7553528Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7553775Z     
2025-05-07T20:32:11.7553971Z         if scale_ub is not None:
2025-05-07T20:32:11.7554243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7554733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7555052Z             )
2025-05-07T20:32:11.7555241Z         else:
2025-05-07T20:32:11.7555459Z             scale_ub_tensor = None
2025-05-07T20:32:11.7555715Z     
2025-05-07T20:32:11.7555947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7556271Z             op = silu_mul_quant
2025-05-07T20:32:11.7556526Z             if compiled:
2025-05-07T20:32:11.7556783Z                 op = torch.compile(op)
2025-05-07T20:32:11.7557084Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7557364Z     
2025-05-07T20:32:11.7557561Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.7557850Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.7558152Z     
2025-05-07T20:32:11.7558397Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7558734Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.7565458Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.7565888Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.7566263Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.7566583Z     
2025-05-07T20:32:11.7566788Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.7567001Z 
2025-05-07T20:32:11.7567106Z moe/activation_test.py:126: 
2025-05-07T20:32:11.7567418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7567873Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.7568211Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.7569032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.7569880Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.7570440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7571152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.7572678Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.7573438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.7574104Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.7574731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.7575270Z     fn()
2025-05-07T20:32:11.7575800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.7576400Z     self.fn.run(
2025-05-07T20:32:11.7576885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7577431Z     kernel = self.compile(
2025-05-07T20:32:11.7577982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7578657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7579074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7579312Z 
2025-05-07T20:32:11.7579525Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175bb0>
2025-05-07T20:32:11.7580651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7582161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96391f8e00>}
2025-05-07T20:32:11.7583551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7584616Z context = <triton._C.libtriton.ir.context object at 0x7f9638cb5ff0>
2025-05-07T20:32:11.7584916Z 
2025-05-07T20:32:11.7585086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7585636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7586121Z                            module_map=module_map)
2025-05-07T20:32:11.7586497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7586866Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.7587139Z E       ^
2025-05-07T20:32:11.7587628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7588094Z 
2025-05-07T20:32:11.7588527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7589065Z 
2025-05-07T20:32:11.7589170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7589594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7590061Z     T=1,
2025-05-07T20:32:11.7590248Z     D=5120,
2025-05-07T20:32:11.7590445Z     scale_ub=1200.0,
2025-05-07T20:32:11.7590673Z     contiguous=True,
2025-05-07T20:32:11.7590894Z     compiled=True,
2025-05-07T20:32:11.7591103Z )
2025-05-07T20:32:11.8948486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.8949187Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.8949458Z 
2025-05-07T20:32:11.8949541Z     @given(
2025-05-07T20:32:11.8949785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.8950104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.8950410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.8950745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.8951079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.8951362Z     )
2025-05-07T20:32:11.8951726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.8952181Z     def test_silu_mul_quant(
2025-05-07T20:32:11.8952423Z         self,
2025-05-07T20:32:11.8952626Z         T: int,
2025-05-07T20:32:11.8952827Z         D: int,
2025-05-07T20:32:11.8953054Z         scale_ub: Optional[float],
2025-05-07T20:32:11.8953331Z         contiguous: bool,
2025-05-07T20:32:11.8953577Z         compiled: bool,
2025-05-07T20:32:11.8953804Z     ) -> None:
2025-05-07T20:32:11.8954022Z         torch.manual_seed(2025)
2025-05-07T20:32:11.8954270Z     
2025-05-07T20:32:11.8954554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.8954904Z     
2025-05-07T20:32:11.8955102Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.8955402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.8955715Z         x = x_sign * x_clamp
2025-05-07T20:32:11.8955965Z         x0 = x[:, :D]
2025-05-07T20:32:11.8956231Z         x1 = x[:, D:]
2025-05-07T20:32:11.8956443Z     
2025-05-07T20:32:11.8956634Z         if contiguous:
2025-05-07T20:32:11.8956871Z             x0 = x0.contiguous()
2025-05-07T20:32:11.8957129Z             x1 = x1.contiguous()
2025-05-07T20:32:11.8957376Z     
2025-05-07T20:32:11.8957572Z         if scale_ub is not None:
2025-05-07T20:32:11.8957847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.8958192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.8958508Z             )
2025-05-07T20:32:11.8958704Z         else:
2025-05-07T20:32:11.8959039Z             scale_ub_tensor = None
2025-05-07T20:32:11.8959297Z     
2025-05-07T20:32:11.8959533Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.8959853Z             op = silu_mul_quant
2025-05-07T20:32:11.8960124Z             if compiled:
2025-05-07T20:32:11.8960373Z                 op = torch.compile(op)
2025-05-07T20:32:11.8960677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8960956Z     
2025-05-07T20:32:11.8961152Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.8961319Z 
2025-05-07T20:32:11.8961423Z moe/activation_test.py:117: 
2025-05-07T20:32:11.8961724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8962066Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.8962358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8962928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.8963510Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.8964188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.8964899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.8965447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.8966223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.8966907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.8967449Z     kernel = self.compile(
2025-05-07T20:32:11.8968010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.8968731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.8969145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8969382Z 
2025-05-07T20:32:11.8969591Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175190>
2025-05-07T20:32:11.8970710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.8972181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cccc0>}
2025-05-07T20:32:11.8973570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.8974627Z context = <triton._C.libtriton.ir.context object at 0x7f963849dc70>
2025-05-07T20:32:11.8974924Z 
2025-05-07T20:32:11.8975099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.8975645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.8976171Z                            module_map=module_map)
2025-05-07T20:32:11.8976540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.8976903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.8977169Z E       ^
2025-05-07T20:32:11.8977645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.8978109Z 
2025-05-07T20:32:11.8978542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.8979081Z 
2025-05-07T20:32:11.8979186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8979612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8980105Z     T=1,
2025-05-07T20:32:11.8980291Z     D=5120,
2025-05-07T20:32:11.8980484Z     scale_ub=None,
2025-05-07T20:32:11.8980701Z     contiguous=False,
2025-05-07T20:32:11.8980924Z     compiled=True,
2025-05-07T20:32:11.8981128Z )
2025-05-07T20:32:11.9592115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9592646Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.9592921Z 
2025-05-07T20:32:11.9592995Z     @given(
2025-05-07T20:32:11.9593224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9593537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9593835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9594171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9594500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9594788Z     )
2025-05-07T20:32:11.9595140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9595586Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9595829Z         self,
2025-05-07T20:32:11.9596019Z         T: int,
2025-05-07T20:32:11.9596212Z         D: int,
2025-05-07T20:32:11.9596430Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9596697Z         contiguous: bool,
2025-05-07T20:32:11.9596939Z         compiled: bool,
2025-05-07T20:32:11.9597271Z     ) -> None:
2025-05-07T20:32:11.9597486Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9597729Z     
2025-05-07T20:32:11.9598010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9598348Z     
2025-05-07T20:32:11.9598540Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9598836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9599221Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9599466Z         x0 = x[:, :D]
2025-05-07T20:32:11.9599684Z         x1 = x[:, D:]
2025-05-07T20:32:11.9599890Z     
2025-05-07T20:32:11.9600077Z         if contiguous:
2025-05-07T20:32:11.9600314Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9600575Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9600812Z     
2025-05-07T20:32:11.9601002Z         if scale_ub is not None:
2025-05-07T20:32:11.9601276Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9601612Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9601929Z             )
2025-05-07T20:32:11.9602131Z         else:
2025-05-07T20:32:11.9602342Z             scale_ub_tensor = None
2025-05-07T20:32:11.9602600Z     
2025-05-07T20:32:11.9602837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9603154Z             op = silu_mul_quant
2025-05-07T20:32:11.9603416Z             if compiled:
2025-05-07T20:32:11.9603666Z                 op = torch.compile(op)
2025-05-07T20:32:11.9603964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9604246Z     
2025-05-07T20:32:11.9604447Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.9604737Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.9605033Z     
2025-05-07T20:32:11.9605277Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9605624Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.9605968Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.9606564Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.9606938Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.9607259Z     
2025-05-07T20:32:11.9607459Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.9607665Z 
2025-05-07T20:32:11.9607767Z moe/activation_test.py:126: 
2025-05-07T20:32:11.9608076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9608414Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.9608751Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.9609721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.9610502Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.9611059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9611815Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9612533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.9613280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.9614030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.9614687Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.9615314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.9615871Z     fn()
2025-05-07T20:32:11.9616415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.9617020Z     self.fn.run(
2025-05-07T20:32:11.9617500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9618115Z     kernel = self.compile(
2025-05-07T20:32:11.9618670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9619349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9619818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9620061Z 
2025-05-07T20:32:11.9620272Z self = <triton.compiler.compiler.ASTSource object at 0x7f96394c87d0>
2025-05-07T20:32:11.9621394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9622815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cf2e0>}
2025-05-07T20:32:11.9624204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9625263Z context = <triton._C.libtriton.ir.context object at 0x7f96384f5cf0>
2025-05-07T20:32:11.9625564Z 
2025-05-07T20:32:11.9625736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9626285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9626816Z                            module_map=module_map)
2025-05-07T20:32:11.9627186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9627552Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.9627822Z E       ^
2025-05-07T20:32:11.9628309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9628786Z 
2025-05-07T20:32:11.9629221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9629761Z 
2025-05-07T20:32:11.9629864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9630292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9630700Z     T=1,
2025-05-07T20:32:11.9630884Z     D=5120,
2025-05-07T20:32:11.9631078Z     scale_ub=None,
2025-05-07T20:32:11.9631295Z     contiguous=True,
2025-05-07T20:32:11.9631601Z     compiled=False,
2025-05-07T20:32:11.9631814Z )
2025-05-07T20:32:12.1118457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1118997Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.1119267Z 
2025-05-07T20:32:12.1119354Z     @given(
2025-05-07T20:32:12.1119581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1119909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1120223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1120551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1120881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1121166Z     )
2025-05-07T20:32:12.1121517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1121970Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1122214Z         self,
2025-05-07T20:32:12.1122422Z         T: int,
2025-05-07T20:32:12.1122618Z         D: int,
2025-05-07T20:32:12.1122841Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1123116Z         contiguous: bool,
2025-05-07T20:32:12.1123355Z         compiled: bool,
2025-05-07T20:32:12.1123584Z     ) -> None:
2025-05-07T20:32:12.1123804Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1124047Z     
2025-05-07T20:32:12.1124323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1124783Z     
2025-05-07T20:32:12.1124974Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1125273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1125590Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1125831Z         x0 = x[:, :D]
2025-05-07T20:32:12.1126119Z         x1 = x[:, D:]
2025-05-07T20:32:12.1126332Z     
2025-05-07T20:32:12.1126516Z         if contiguous:
2025-05-07T20:32:12.1126752Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1127019Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1127264Z     
2025-05-07T20:32:12.1127460Z         if scale_ub is not None:
2025-05-07T20:32:12.1127742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1128084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1128395Z             )
2025-05-07T20:32:12.1128587Z         else:
2025-05-07T20:32:12.1128800Z             scale_ub_tensor = None
2025-05-07T20:32:12.1129055Z     
2025-05-07T20:32:12.1129292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1129614Z             op = silu_mul_quant
2025-05-07T20:32:12.1129869Z             if compiled:
2025-05-07T20:32:12.1130122Z                 op = torch.compile(op)
2025-05-07T20:32:12.1130426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1130706Z     
2025-05-07T20:32:12.1130909Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1131077Z 
2025-05-07T20:32:12.1131179Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1131482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1131882Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1132169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1132885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1133592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1134149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1134854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1135543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1136093Z     kernel = self.compile(
2025-05-07T20:32:12.1136652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1137452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1137863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1138106Z 
2025-05-07T20:32:12.1138316Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396ae750>
2025-05-07T20:32:12.1139435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1140856Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cfb00>}
2025-05-07T20:32:12.1142276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1143333Z context = <triton._C.libtriton.ir.context object at 0x7f963827de30>
2025-05-07T20:32:12.1143628Z 
2025-05-07T20:32:12.1143798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1144337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1144818Z                            module_map=module_map)
2025-05-07T20:32:12.1145244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1145603Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1145868Z E       ^
2025-05-07T20:32:12.1146350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1146858Z 
2025-05-07T20:32:12.1147289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1147827Z 
2025-05-07T20:32:12.1147938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1148364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1148780Z     T=128,
2025-05-07T20:32:12.1148967Z     D=5120,
2025-05-07T20:32:12.1149163Z     scale_ub=None,
2025-05-07T20:32:12.1149386Z     contiguous=False,
2025-05-07T20:32:12.1149616Z     compiled=True,
2025-05-07T20:32:12.1149830Z )
2025-05-07T20:32:12.1150159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1150662Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.1150947Z 
2025-05-07T20:32:12.1151028Z     @given(
2025-05-07T20:32:12.1151267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1151586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1151905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1152241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1152589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1152877Z     )
2025-05-07T20:32:12.1153234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1153688Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1153932Z         self,
2025-05-07T20:32:12.1154134Z         T: int,
2025-05-07T20:32:12.1154336Z         D: int,
2025-05-07T20:32:12.1154558Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1154839Z         contiguous: bool,
2025-05-07T20:32:12.1155086Z         compiled: bool,
2025-05-07T20:32:12.1155311Z     ) -> None:
2025-05-07T20:32:12.1155534Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1155779Z     
2025-05-07T20:32:12.1156082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1156462Z     
2025-05-07T20:32:12.1156663Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1156957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1157358Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1157604Z         x0 = x[:, :D]
2025-05-07T20:32:12.1157825Z         x1 = x[:, D:]
2025-05-07T20:32:12.1158034Z     
2025-05-07T20:32:12.1158225Z         if contiguous:
2025-05-07T20:32:12.1158461Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1158721Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1158963Z     
2025-05-07T20:32:12.1159160Z         if scale_ub is not None:
2025-05-07T20:32:12.1159436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1159776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1160091Z             )
2025-05-07T20:32:12.1160285Z         else:
2025-05-07T20:32:12.1160504Z             scale_ub_tensor = None
2025-05-07T20:32:12.1160762Z     
2025-05-07T20:32:12.1160993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1161313Z             op = silu_mul_quant
2025-05-07T20:32:12.1161569Z             if compiled:
2025-05-07T20:32:12.1161821Z                 op = torch.compile(op)
2025-05-07T20:32:12.1162123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1162411Z     
2025-05-07T20:32:12.1162607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1162774Z 
2025-05-07T20:32:12.1162875Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1163178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1163575Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1163861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1164438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.1165016Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.1165696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1166469Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1167028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1167736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1168430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1168989Z     kernel = self.compile(
2025-05-07T20:32:12.1169548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1176527Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1176949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1177197Z 
2025-05-07T20:32:12.1177410Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f5df0>
2025-05-07T20:32:12.1178535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1179947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959c360>}
2025-05-07T20:32:12.1181337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1182409Z context = <triton._C.libtriton.ir.context object at 0x7f9639094df0>
2025-05-07T20:32:12.1182715Z 
2025-05-07T20:32:12.1182888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1183434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1183917Z                            module_map=module_map)
2025-05-07T20:32:12.1184401Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1184767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1185030Z E       ^
2025-05-07T20:32:12.1185515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1186040Z 
2025-05-07T20:32:12.1186473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1187012Z 
2025-05-07T20:32:12.1187122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1187550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1187965Z     T=128,
2025-05-07T20:32:12.1188163Z     D=7168,
2025-05-07T20:32:12.1188365Z     scale_ub=1200.0,
2025-05-07T20:32:12.1188596Z     contiguous=False,
2025-05-07T20:32:12.1188828Z     compiled=False,
2025-05-07T20:32:12.1189036Z )
2025-05-07T20:32:12.2324217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.2324988Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.2325383Z 
2025-05-07T20:32:12.2325502Z     @given(
2025-05-07T20:32:12.2325741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.2326064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.2326385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.2327205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.2327673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.2328054Z     )
2025-05-07T20:32:12.2328444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.2329020Z     def test_silu_mul_quant(
2025-05-07T20:32:12.2329276Z         self,
2025-05-07T20:32:12.2329485Z         T: int,
2025-05-07T20:32:12.2329685Z         D: int,
2025-05-07T20:32:12.2329912Z         scale_ub: Optional[float],
2025-05-07T20:32:12.2330202Z         contiguous: bool,
2025-05-07T20:32:12.2330450Z         compiled: bool,
2025-05-07T20:32:12.2330685Z     ) -> None:
2025-05-07T20:32:12.2330915Z         torch.manual_seed(2025)
2025-05-07T20:32:12.2331162Z     
2025-05-07T20:32:12.2331450Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.2331884Z     
2025-05-07T20:32:12.2332083Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.2332396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.2332721Z         x = x_sign * x_clamp
2025-05-07T20:32:12.2332968Z         x0 = x[:, :D]
2025-05-07T20:32:12.2333198Z         x1 = x[:, D:]
2025-05-07T20:32:12.2333416Z     
2025-05-07T20:32:12.2333603Z         if contiguous:
2025-05-07T20:32:12.2333853Z             x0 = x0.contiguous()
2025-05-07T20:32:12.2334124Z             x1 = x1.contiguous()
2025-05-07T20:32:12.2334376Z     
2025-05-07T20:32:12.2334566Z         if scale_ub is not None:
2025-05-07T20:32:12.2334859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.2335211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.2335521Z             )
2025-05-07T20:32:12.2335726Z         else:
2025-05-07T20:32:12.2335942Z             scale_ub_tensor = None
2025-05-07T20:32:12.2336194Z     
2025-05-07T20:32:12.2336436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.2336767Z             op = silu_mul_quant
2025-05-07T20:32:12.2337019Z             if compiled:
2025-05-07T20:32:12.2337276Z                 op = torch.compile(op)
2025-05-07T20:32:12.2337585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2337928Z     
2025-05-07T20:32:12.2338315Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.2338549Z 
2025-05-07T20:32:12.2338700Z moe/activation_test.py:117: 
2025-05-07T20:32:12.2339074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2339417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.2339896Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2340627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.2341340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.2341903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.2342617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.2343308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.2343910Z     kernel = self.compile(
2025-05-07T20:32:12.2344478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.2345170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.2345586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2345838Z 
2025-05-07T20:32:12.2346052Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f7fe0>
2025-05-07T20:32:12.2347180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.2348758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959cae0>}
2025-05-07T20:32:12.2350160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.2351270Z context = <triton._C.libtriton.ir.context object at 0x7f963871cf70>
2025-05-07T20:32:12.2351587Z 
2025-05-07T20:32:12.2351759Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.2352310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.2352797Z                            module_map=module_map)
2025-05-07T20:32:12.2353172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.2353547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.2353819Z E       ^
2025-05-07T20:32:12.2354300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.2354777Z 
2025-05-07T20:32:12.2355209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.2355757Z 
2025-05-07T20:32:12.2355865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.2356302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.2356717Z     T=128,
2025-05-07T20:32:12.2356916Z     D=5120,
2025-05-07T20:32:12.2357122Z     scale_ub=None,
2025-05-07T20:32:12.2357341Z     contiguous=False,
2025-05-07T20:32:12.2357579Z     compiled=False,
2025-05-07T20:32:12.2357802Z )
2025-05-07T20:32:12.2358127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.2358674Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.2358962Z 
2025-05-07T20:32:12.2359042Z     @given(
2025-05-07T20:32:12.2359286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.2359620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.2359931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.2360277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.2360621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.2360914Z     )
2025-05-07T20:32:12.2361381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.2361846Z     def test_silu_mul_quant(
2025-05-07T20:32:12.2362090Z         self,
2025-05-07T20:32:12.2362297Z         T: int,
2025-05-07T20:32:12.2362509Z         D: int,
2025-05-07T20:32:12.2362728Z         scale_ub: Optional[float],
2025-05-07T20:32:12.2363010Z         contiguous: bool,
2025-05-07T20:32:12.2363264Z         compiled: bool,
2025-05-07T20:32:12.2363488Z     ) -> None:
2025-05-07T20:32:12.2363712Z         torch.manual_seed(2025)
2025-05-07T20:32:12.2363963Z     
2025-05-07T20:32:12.2364241Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.2364595Z     
2025-05-07T20:32:12.2364839Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.2365150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.2365467Z         x = x_sign * x_clamp
2025-05-07T20:32:12.2365724Z         x0 = x[:, :D]
2025-05-07T20:32:12.2365978Z         x1 = x[:, D:]
2025-05-07T20:32:12.2366219Z     
2025-05-07T20:32:12.2366416Z         if contiguous:
2025-05-07T20:32:12.2366660Z             x0 = x0.contiguous()
2025-05-07T20:32:12.2366922Z             x1 = x1.contiguous()
2025-05-07T20:32:12.2367178Z     
2025-05-07T20:32:12.2367382Z         if scale_ub is not None:
2025-05-07T20:32:12.2367659Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.2368075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.2368398Z             )
2025-05-07T20:32:12.2368593Z         else:
2025-05-07T20:32:12.2368816Z             scale_ub_tensor = None
2025-05-07T20:32:12.2369079Z     
2025-05-07T20:32:12.2369312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.2369687Z             op = silu_mul_quant
2025-05-07T20:32:12.2369953Z             if compiled:
2025-05-07T20:32:12.2370209Z                 op = torch.compile(op)
2025-05-07T20:32:12.2370509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2370800Z     
2025-05-07T20:32:12.2371002Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.2371170Z 
2025-05-07T20:32:12.2371272Z moe/activation_test.py:117: 
2025-05-07T20:32:12.2371583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2372053Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.2372342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2373067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.2373794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.2374367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.2375083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.2375786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.2376356Z     kernel = self.compile(
2025-05-07T20:32:12.2376914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.2377605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.2378026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2378295Z 
2025-05-07T20:32:12.2378512Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f7ce0>
2025-05-07T20:32:12.2379646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.2381092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959dd00>}
2025-05-07T20:32:12.2382579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.2383663Z context = <triton._C.libtriton.ir.context object at 0x7f963838d3f0>
2025-05-07T20:32:12.2383970Z 
2025-05-07T20:32:12.2384146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.2384707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.2385196Z                            module_map=module_map)
2025-05-07T20:32:12.2385586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.2385962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.2386227Z E       ^
2025-05-07T20:32:12.2386719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.2387204Z 
2025-05-07T20:32:12.2387637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.2388172Z 
2025-05-07T20:32:12.2388287Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.2388749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.2389175Z     T=128,
2025-05-07T20:32:12.2389430Z     D=5120,
2025-05-07T20:32:12.2389629Z     scale_ub=1200.0,
2025-05-07T20:32:12.2389866Z     contiguous=True,
2025-05-07T20:32:12.2390095Z     compiled=False,
2025-05-07T20:32:12.2390312Z )
2025-05-07T20:32:12.4115396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4116526Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4116921Z 
2025-05-07T20:32:12.4117032Z     @given(
2025-05-07T20:32:12.4117353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4117693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4118007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4118354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4118693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4118977Z     )
2025-05-07T20:32:12.4119344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4119821Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4120075Z         self,
2025-05-07T20:32:12.4120292Z         T: int,
2025-05-07T20:32:12.4120504Z         D: int,
2025-05-07T20:32:12.4120732Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4121022Z         contiguous: bool,
2025-05-07T20:32:12.4121286Z         compiled: bool,
2025-05-07T20:32:12.4121528Z     ) -> None:
2025-05-07T20:32:12.4121785Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4122037Z     
2025-05-07T20:32:12.4122339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4122706Z     
2025-05-07T20:32:12.4122913Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4123225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4123557Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4123818Z         x0 = x[:, :D]
2025-05-07T20:32:12.4124046Z         x1 = x[:, D:]
2025-05-07T20:32:12.4124267Z     
2025-05-07T20:32:12.4124472Z         if contiguous:
2025-05-07T20:32:12.4124711Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4124987Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4125242Z     
2025-05-07T20:32:12.4125438Z         if scale_ub is not None:
2025-05-07T20:32:12.4125726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4126133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4126454Z             )
2025-05-07T20:32:12.4126663Z         else:
2025-05-07T20:32:12.4126889Z             scale_ub_tensor = None
2025-05-07T20:32:12.4127147Z     
2025-05-07T20:32:12.4127583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4127919Z             op = silu_mul_quant
2025-05-07T20:32:12.4128179Z             if compiled:
2025-05-07T20:32:12.4128446Z                 op = torch.compile(op)
2025-05-07T20:32:12.4128761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4129055Z     
2025-05-07T20:32:12.4129255Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4129439Z 
2025-05-07T20:32:12.4129546Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4129864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4130211Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4130512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4131247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4132065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4132639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4133361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4134058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4134614Z     kernel = self.compile(
2025-05-07T20:32:12.4135294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4136028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4136462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4136747Z 
2025-05-07T20:32:12.4136966Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900df10>
2025-05-07T20:32:12.4138105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4139555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959ede0>}
2025-05-07T20:32:12.4140965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4142035Z context = <triton._C.libtriton.ir.context object at 0x7f96385efdb0>
2025-05-07T20:32:12.4142343Z 
2025-05-07T20:32:12.4142518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4143068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4143562Z                            module_map=module_map)
2025-05-07T20:32:12.4143943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4144314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4144596Z E       ^
2025-05-07T20:32:12.4145076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4145551Z 
2025-05-07T20:32:12.4145985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4146533Z 
2025-05-07T20:32:12.4146644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4147081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4147500Z     T=1,
2025-05-07T20:32:12.4147706Z     D=7168,
2025-05-07T20:32:12.4147918Z     scale_ub=1200.0,
2025-05-07T20:32:12.4148150Z     contiguous=True,
2025-05-07T20:32:12.4148389Z     compiled=True,
2025-05-07T20:32:12.4148617Z )
2025-05-07T20:32:12.4149033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4149549Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4149830Z 
2025-05-07T20:32:12.4149914Z     @given(
2025-05-07T20:32:12.4150162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4150487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4150814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4151165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4151506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4151808Z     )
2025-05-07T20:32:12.4152178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4152637Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4152896Z         self,
2025-05-07T20:32:12.4153114Z         T: int,
2025-05-07T20:32:12.4153329Z         D: int,
2025-05-07T20:32:12.4153555Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4153851Z         contiguous: bool,
2025-05-07T20:32:12.4154110Z         compiled: bool,
2025-05-07T20:32:12.4154342Z     ) -> None:
2025-05-07T20:32:12.4154575Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4154834Z     
2025-05-07T20:32:12.4155115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4155476Z     
2025-05-07T20:32:12.4155744Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4156047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4156378Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4156640Z         x0 = x[:, :D]
2025-05-07T20:32:12.4156868Z         x1 = x[:, D:]
2025-05-07T20:32:12.4157092Z     
2025-05-07T20:32:12.4157295Z         if contiguous:
2025-05-07T20:32:12.4157585Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4157865Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4158122Z     
2025-05-07T20:32:12.4158320Z         if scale_ub is not None:
2025-05-07T20:32:12.4158619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4158979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4159308Z             )
2025-05-07T20:32:12.4159513Z         else:
2025-05-07T20:32:12.4159743Z             scale_ub_tensor = None
2025-05-07T20:32:12.4160012Z     
2025-05-07T20:32:12.4160255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4160592Z             op = silu_mul_quant
2025-05-07T20:32:12.4160862Z             if compiled:
2025-05-07T20:32:12.4161118Z                 op = torch.compile(op)
2025-05-07T20:32:12.4161436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4161733Z     
2025-05-07T20:32:12.4161935Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4162120Z 
2025-05-07T20:32:12.4162226Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4162542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4162891Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4163189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4163776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4164363Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4165043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4165770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4166386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4167102Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4167795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4168354Z     kernel = self.compile(
2025-05-07T20:32:12.4169008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4169689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4170112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4170362Z 
2025-05-07T20:32:12.4170579Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900c920>
2025-05-07T20:32:12.4171713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4173196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385ac4a0>}
2025-05-07T20:32:12.4174605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4175677Z context = <triton._C.libtriton.ir.context object at 0x7f9638581170>
2025-05-07T20:32:12.4175975Z 
2025-05-07T20:32:12.4176160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4176710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4177270Z                            module_map=module_map)
2025-05-07T20:32:12.4177659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4178031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4178300Z E       ^
2025-05-07T20:32:12.4178790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4179303Z 
2025-05-07T20:32:12.4179753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4180286Z 
2025-05-07T20:32:12.4180404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4180834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4181291Z     T=1,
2025-05-07T20:32:12.4181494Z     D=7168,
2025-05-07T20:32:12.4181698Z     scale_ub=1200.0,
2025-05-07T20:32:12.4181942Z     contiguous=False,
2025-05-07T20:32:12.4182189Z     compiled=True,
2025-05-07T20:32:12.4193565Z )
2025-05-07T20:32:12.5486234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.5486990Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.5487365Z 
2025-05-07T20:32:12.5487481Z     @given(
2025-05-07T20:32:12.5487808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.5488139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.5488459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.5488812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.5489159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.5489463Z     )
2025-05-07T20:32:12.5489837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.5490301Z     def test_silu_mul_quant(
2025-05-07T20:32:12.5490560Z         self,
2025-05-07T20:32:12.5490781Z         T: int,
2025-05-07T20:32:12.5490990Z         D: int,
2025-05-07T20:32:12.5491226Z         scale_ub: Optional[float],
2025-05-07T20:32:12.5491516Z         contiguous: bool,
2025-05-07T20:32:12.5491829Z         compiled: bool,
2025-05-07T20:32:12.5492077Z     ) -> None:
2025-05-07T20:32:12.5492311Z         torch.manual_seed(2025)
2025-05-07T20:32:12.5492566Z     
2025-05-07T20:32:12.5492859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.5493223Z     
2025-05-07T20:32:12.5493427Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.5494082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.5494417Z         x = x_sign * x_clamp
2025-05-07T20:32:12.5494676Z         x0 = x[:, :D]
2025-05-07T20:32:12.5494904Z         x1 = x[:, D:]
2025-05-07T20:32:12.5495129Z     
2025-05-07T20:32:12.5495333Z         if contiguous:
2025-05-07T20:32:12.5495577Z             x0 = x0.contiguous()
2025-05-07T20:32:12.5495851Z             x1 = x1.contiguous()
2025-05-07T20:32:12.5496110Z     
2025-05-07T20:32:12.5496310Z         if scale_ub is not None:
2025-05-07T20:32:12.5496596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.5496953Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.5497277Z             )
2025-05-07T20:32:12.5497494Z         else:
2025-05-07T20:32:12.5497724Z             scale_ub_tensor = None
2025-05-07T20:32:12.5497984Z     
2025-05-07T20:32:12.5498236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.5498569Z             op = silu_mul_quant
2025-05-07T20:32:12.5498838Z             if compiled:
2025-05-07T20:32:12.5499108Z                 op = torch.compile(op)
2025-05-07T20:32:12.5499421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5499706Z     
2025-05-07T20:32:12.5499919Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.5500098Z 
2025-05-07T20:32:12.5500205Z moe/activation_test.py:117: 
2025-05-07T20:32:12.5500524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5500963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.5501261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5501853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.5502519Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.5503207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.5503932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.5504498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.5505208Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.5505904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.5506761Z     kernel = self.compile(
2025-05-07T20:32:12.5507326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.5508016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.5508450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5508696Z 
2025-05-07T20:32:12.5508923Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900fe00>
2025-05-07T20:32:12.5510058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.5511509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385adb20>}
2025-05-07T20:32:12.5512924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.5513999Z context = <triton._C.libtriton.ir.context object at 0x7f9638785bb0>
2025-05-07T20:32:12.5514303Z 
2025-05-07T20:32:12.5514485Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.5515030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.5515658Z                            module_map=module_map)
2025-05-07T20:32:12.5516048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.5516416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.5516696Z E       ^
2025-05-07T20:32:12.5517188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.5517661Z 
2025-05-07T20:32:12.5518104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.5518639Z 
2025-05-07T20:32:12.5518750Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.5519190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.5519627Z     T=1,
2025-05-07T20:32:12.5519823Z     D=7168,
2025-05-07T20:32:12.5520034Z     scale_ub=None,
2025-05-07T20:32:12.5520270Z     contiguous=False,
2025-05-07T20:32:12.5520513Z     compiled=True,
2025-05-07T20:32:12.5520738Z )
2025-05-07T20:32:12.8174559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8175315Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.8175656Z 
2025-05-07T20:32:12.8175748Z     @given(
2025-05-07T20:32:12.8175988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8176314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8176939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8177278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8177623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8177922Z     )
2025-05-07T20:32:12.8178298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8178848Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8179109Z         self,
2025-05-07T20:32:12.8179319Z         T: int,
2025-05-07T20:32:12.8179525Z         D: int,
2025-05-07T20:32:12.8179767Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8180052Z         contiguous: bool,
2025-05-07T20:32:12.8180301Z         compiled: bool,
2025-05-07T20:32:12.8180543Z     ) -> None:
2025-05-07T20:32:12.8180772Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8181026Z     
2025-05-07T20:32:12.8181313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8181679Z     
2025-05-07T20:32:12.8181881Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8182189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8182517Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8182767Z         x0 = x[:, :D]
2025-05-07T20:32:12.8183005Z         x1 = x[:, D:]
2025-05-07T20:32:12.8183231Z     
2025-05-07T20:32:12.8183425Z         if contiguous:
2025-05-07T20:32:12.8183677Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8183955Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8184210Z     
2025-05-07T20:32:12.8184418Z         if scale_ub is not None:
2025-05-07T20:32:12.8184705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8185060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8185380Z             )
2025-05-07T20:32:12.8185590Z         else:
2025-05-07T20:32:12.8185817Z             scale_ub_tensor = None
2025-05-07T20:32:12.8186120Z     
2025-05-07T20:32:12.8186378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8186717Z             op = silu_mul_quant
2025-05-07T20:32:12.8186976Z             if compiled:
2025-05-07T20:32:12.8187236Z                 op = torch.compile(op)
2025-05-07T20:32:12.8187549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8187832Z     
2025-05-07T20:32:12.8188043Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.8188344Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.8188650Z     
2025-05-07T20:32:12.8188897Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8189406Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.8189718Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.8190043Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.8190419Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.8190748Z     
2025-05-07T20:32:12.8190957Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.8191168Z 
2025-05-07T20:32:12.8191274Z moe/activation_test.py:126: 
2025-05-07T20:32:12.8191588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8191945Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.8192285Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.8193118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.8193910Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.8194480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8195198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8195919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.8196728Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.8197483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.8198150Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.8198826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.8199370Z     fn()
2025-05-07T20:32:12.8199902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.8200512Z     self.fn.run(
2025-05-07T20:32:12.8201002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8201552Z     kernel = self.compile(
2025-05-07T20:32:12.8202141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8202830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8203255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8203497Z 
2025-05-07T20:32:12.8203714Z self = <triton.compiler.compiler.ASTSource object at 0x7f96383780e0>
2025-05-07T20:32:12.8204853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8206643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385ae840>}
2025-05-07T20:32:12.8208049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8209120Z context = <triton._C.libtriton.ir.context object at 0x7f96383ac0f0>
2025-05-07T20:32:12.8209427Z 
2025-05-07T20:32:12.8209603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8210156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8210655Z                            module_map=module_map)
2025-05-07T20:32:12.8211034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8211535Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.8211885Z E       ^
2025-05-07T20:32:12.8212369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8212845Z 
2025-05-07T20:32:12.8213282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8213826Z 
2025-05-07T20:32:12.8213934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8214371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8214788Z     T=1,
2025-05-07T20:32:12.8214981Z     D=5120,
2025-05-07T20:32:12.8215186Z     scale_ub=1200.0,
2025-05-07T20:32:12.8215424Z     contiguous=False,
2025-05-07T20:32:12.8215660Z     compiled=True,
2025-05-07T20:32:12.8215879Z )
2025-05-07T20:32:12.9730030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9730661Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.9731051Z 
2025-05-07T20:32:12.9731171Z     @given(
2025-05-07T20:32:12.9731486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9732019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9732339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9732674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9733199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9733486Z     )
2025-05-07T20:32:12.9733843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9734289Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9734541Z         self,
2025-05-07T20:32:12.9734833Z         T: int,
2025-05-07T20:32:12.9735030Z         D: int,
2025-05-07T20:32:12.9735253Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9735535Z         contiguous: bool,
2025-05-07T20:32:12.9735781Z         compiled: bool,
2025-05-07T20:32:12.9736018Z     ) -> None:
2025-05-07T20:32:12.9736243Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9736486Z     
2025-05-07T20:32:12.9736763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9737114Z     
2025-05-07T20:32:12.9737314Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9737605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9737927Z         x = x_sign * x_clamp
2025-05-07T20:32:12.9738175Z         x0 = x[:, :D]
2025-05-07T20:32:12.9738391Z         x1 = x[:, D:]
2025-05-07T20:32:12.9738608Z     
2025-05-07T20:32:12.9738799Z         if contiguous:
2025-05-07T20:32:12.9739030Z             x0 = x0.contiguous()
2025-05-07T20:32:12.9739295Z             x1 = x1.contiguous()
2025-05-07T20:32:12.9739546Z     
2025-05-07T20:32:12.9739734Z         if scale_ub is not None:
2025-05-07T20:32:12.9740012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.9740366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.9740677Z             )
2025-05-07T20:32:12.9740878Z         else:
2025-05-07T20:32:12.9741098Z             scale_ub_tensor = None
2025-05-07T20:32:12.9741349Z     
2025-05-07T20:32:12.9741590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.9741917Z             op = silu_mul_quant
2025-05-07T20:32:12.9742177Z             if compiled:
2025-05-07T20:32:12.9742425Z                 op = torch.compile(op)
2025-05-07T20:32:12.9742732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9743015Z     
2025-05-07T20:32:12.9743207Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.9743380Z 
2025-05-07T20:32:12.9743482Z moe/activation_test.py:117: 
2025-05-07T20:32:12.9743789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9744124Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.9744412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9745156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.9745741Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.9746483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.9747201Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.9747764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.9748470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.9749159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.9749715Z     kernel = self.compile(
2025-05-07T20:32:12.9750278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.9750961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.9751372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9751607Z 
2025-05-07T20:32:12.9751822Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638379ac0>
2025-05-07T20:32:12.9752939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.9754451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385aff60>}
2025-05-07T20:32:12.9755873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.9756936Z context = <triton._C.libtriton.ir.context object at 0x7f96380a5970>
2025-05-07T20:32:12.9757238Z 
2025-05-07T20:32:12.9757410Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.9757946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.9758418Z                            module_map=module_map)
2025-05-07T20:32:12.9758792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.9759153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.9759417Z E       ^
2025-05-07T20:32:12.9759888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.9760359Z 
2025-05-07T20:32:12.9760786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.9761316Z 
2025-05-07T20:32:12.9761432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9761850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9762263Z     T=1,
2025-05-07T20:32:12.9762447Z     D=5120,
2025-05-07T20:32:12.9762646Z     scale_ub=1200.0,
2025-05-07T20:32:12.9762869Z     contiguous=False,
2025-05-07T20:32:12.9763098Z     compiled=False,
2025-05-07T20:32:12.9763310Z )
2025-05-07T20:32:12.9763633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9764139Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.9764413Z 
2025-05-07T20:32:12.9764496Z     @given(
2025-05-07T20:32:12.9764724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9765046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9765357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9765687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9766108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9766402Z     )
2025-05-07T20:32:12.9766760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9767204Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9767452Z         self,
2025-05-07T20:32:12.9767655Z         T: int,
2025-05-07T20:32:12.9767849Z         D: int,
2025-05-07T20:32:12.9768071Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9768344Z         contiguous: bool,
2025-05-07T20:32:12.9768583Z         compiled: bool,
2025-05-07T20:32:12.9768808Z     ) -> None:
2025-05-07T20:32:12.9769025Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9769263Z     
2025-05-07T20:32:12.9769539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9769892Z     
2025-05-07T20:32:12.9770081Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9770377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9770701Z         x = x_sign * x_clamp
2025-05-07T20:32:12.9770939Z         x0 = x[:, :D]
2025-05-07T20:32:12.9771163Z         x1 = x[:, D:]
2025-05-07T20:32:12.9771376Z     
2025-05-07T20:32:12.9771561Z         if contiguous:
2025-05-07T20:32:12.9771858Z             x0 = x0.contiguous()
2025-05-07T20:32:12.9772123Z             x1 = x1.contiguous()
2025-05-07T20:32:12.9772372Z     
2025-05-07T20:32:12.9772560Z         if scale_ub is not None:
2025-05-07T20:32:12.9772888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.9773315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.9773671Z             )
2025-05-07T20:32:12.9773873Z         else:
2025-05-07T20:32:12.9774090Z             scale_ub_tensor = None
2025-05-07T20:32:12.9774339Z     
2025-05-07T20:32:12.9774634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.9774959Z             op = silu_mul_quant
2025-05-07T20:32:12.9775207Z             if compiled:
2025-05-07T20:32:12.9775466Z                 op = torch.compile(op)
2025-05-07T20:32:12.9775769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9776068Z     
2025-05-07T20:32:12.9776290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.9776464Z 
2025-05-07T20:32:12.9776567Z moe/activation_test.py:117: 
2025-05-07T20:32:12.9776869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9777206Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.9777499Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9778209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.9778913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.9779471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.9780174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.9780864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.9781405Z     kernel = self.compile(
2025-05-07T20:32:12.9781960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.9782638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.9783043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9783285Z 
2025-05-07T20:32:12.9783495Z self = <triton.compiler.compiler.ASTSource object at 0x7f963953cda0>
2025-05-07T20:32:12.9784616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.9786129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966e980>}
2025-05-07T20:32:12.9787515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.9788567Z context = <triton._C.libtriton.ir.context object at 0x7f9639415530>
2025-05-07T20:32:12.9788870Z 
2025-05-07T20:32:12.9789040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.9789578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.9790055Z                            module_map=module_map)
2025-05-07T20:32:12.9790423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.9790786Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.9791048Z E       ^
2025-05-07T20:32:12.9791522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.9791991Z 
2025-05-07T20:32:12.9792420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.9792956Z 
2025-05-07T20:32:12.9793059Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9793481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9793933Z     T=16384,
2025-05-07T20:32:12.9794132Z     D=5120,
2025-05-07T20:32:12.9794347Z     scale_ub=1200.0,
2025-05-07T20:32:12.9794568Z     contiguous=False,
2025-05-07T20:32:12.9794799Z     compiled=True,
2025-05-07T20:32:12.9795003Z )
2025-05-07T20:32:13.0661153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0661804Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.0662093Z 
2025-05-07T20:32:13.0662173Z     @given(
2025-05-07T20:32:13.0662432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0662747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0663052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0663385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0663717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0664009Z     )
2025-05-07T20:32:13.0664468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0665006Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0665331Z         self,
2025-05-07T20:32:13.0665804Z         T: int,
2025-05-07T20:32:13.0666088Z         D: int,
2025-05-07T20:32:13.0676118Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0676539Z         contiguous: bool,
2025-05-07T20:32:13.0676932Z         compiled: bool,
2025-05-07T20:32:13.0677296Z     ) -> None:
2025-05-07T20:32:13.0677528Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0677788Z     
2025-05-07T20:32:13.0678064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0678416Z     
2025-05-07T20:32:13.0678615Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0678917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0679230Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0679656Z         x0 = x[:, :D]
2025-05-07T20:32:13.0679888Z         x1 = x[:, D:]
2025-05-07T20:32:13.0680095Z     
2025-05-07T20:32:13.0680287Z         if contiguous:
2025-05-07T20:32:13.0680523Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0680776Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0681014Z     
2025-05-07T20:32:13.0681207Z         if scale_ub is not None:
2025-05-07T20:32:13.0681480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0681828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0682151Z             )
2025-05-07T20:32:13.0682345Z         else:
2025-05-07T20:32:13.0682908Z             scale_ub_tensor = None
2025-05-07T20:32:13.0683172Z     
2025-05-07T20:32:13.0683412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0683732Z             op = silu_mul_quant
2025-05-07T20:32:13.0683990Z             if compiled:
2025-05-07T20:32:13.0684242Z                 op = torch.compile(op)
2025-05-07T20:32:13.0684539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0684824Z     
2025-05-07T20:32:13.0685022Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0685189Z 
2025-05-07T20:32:13.0685290Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0685597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0685943Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0686227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0686804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.0687391Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.0688070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0688773Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0689328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0690126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0690808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0691350Z     kernel = self.compile(
2025-05-07T20:32:13.0691982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0692758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0693177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0693422Z 
2025-05-07T20:32:13.0693635Z self = <triton.compiler.compiler.ASTSource object at 0x7f963953c0b0>
2025-05-07T20:32:13.0694755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0696208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966fec0>}
2025-05-07T20:32:13.0697603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0698665Z context = <triton._C.libtriton.ir.context object at 0x7f9639361cb0>
2025-05-07T20:32:13.0698966Z 
2025-05-07T20:32:13.0699142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0699680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0700160Z                            module_map=module_map)
2025-05-07T20:32:13.0700533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0700901Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0701167Z E       ^
2025-05-07T20:32:13.0701647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0702118Z 
2025-05-07T20:32:13.0702548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0703089Z 
2025-05-07T20:32:13.0703193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0703619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0704116Z     T=2048,
2025-05-07T20:32:13.0704313Z     D=7168,
2025-05-07T20:32:13.0704510Z     scale_ub=1200.0,
2025-05-07T20:32:13.0704733Z     contiguous=False,
2025-05-07T20:32:13.0704962Z     compiled=True,
2025-05-07T20:32:13.0705175Z )
2025-05-07T20:32:13.0705498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0706044Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.0706816Z 
2025-05-07T20:32:13.0706909Z     @given(
2025-05-07T20:32:13.0707142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0707462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0707779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0708121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0708444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0708732Z     )
2025-05-07T20:32:13.0709095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0709548Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0709793Z         self,
2025-05-07T20:32:13.0709990Z         T: int,
2025-05-07T20:32:13.0710185Z         D: int,
2025-05-07T20:32:13.0710406Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0710681Z         contiguous: bool,
2025-05-07T20:32:13.0710918Z         compiled: bool,
2025-05-07T20:32:13.0711232Z     ) -> None:
2025-05-07T20:32:13.0711449Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0711689Z     
2025-05-07T20:32:13.0711965Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0712313Z     
2025-05-07T20:32:13.0712504Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0712869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0713186Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0713428Z         x0 = x[:, :D]
2025-05-07T20:32:13.0713643Z         x1 = x[:, D:]
2025-05-07T20:32:13.0713858Z     
2025-05-07T20:32:13.0714045Z         if contiguous:
2025-05-07T20:32:13.0714272Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0714536Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0714774Z     
2025-05-07T20:32:13.0714955Z         if scale_ub is not None:
2025-05-07T20:32:13.0715230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0715569Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0715879Z             )
2025-05-07T20:32:13.0716075Z         else:
2025-05-07T20:32:13.0716289Z             scale_ub_tensor = None
2025-05-07T20:32:13.0716540Z     
2025-05-07T20:32:13.0716775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0717095Z             op = silu_mul_quant
2025-05-07T20:32:13.0717352Z             if compiled:
2025-05-07T20:32:13.0717606Z                 op = torch.compile(op)
2025-05-07T20:32:13.0717911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0718190Z     
2025-05-07T20:32:13.0718384Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0718559Z 
2025-05-07T20:32:13.0718658Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0718962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0719298Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0719587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0720162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.0720734Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.0721413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0722128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0722679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0723496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0724179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0724727Z     kernel = self.compile(
2025-05-07T20:32:13.0725281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0725953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0726409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0726641Z 
2025-05-07T20:32:13.0726856Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638ae1af0>
2025-05-07T20:32:13.0727969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0729389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966fc40>}
2025-05-07T20:32:13.0730776Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0731946Z context = <triton._C.libtriton.ir.context object at 0x7f9638607af0>
2025-05-07T20:32:13.0732243Z 
2025-05-07T20:32:13.0732421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0732952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0733481Z                            module_map=module_map)
2025-05-07T20:32:13.0733856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0734218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0734478Z E       ^
2025-05-07T20:32:13.0734958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0735422Z 
2025-05-07T20:32:13.0735853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0736384Z 
2025-05-07T20:32:13.1876843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1877518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1878109Z     T=1,
2025-05-07T20:32:13.1878295Z     D=5120,
2025-05-07T20:32:13.1878483Z     scale_ub=None,
2025-05-07T20:32:13.1878697Z     contiguous=False,
2025-05-07T20:32:13.1878928Z     compiled=False,
2025-05-07T20:32:13.1879154Z )
2025-05-07T20:32:13.1879477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1879983Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.1880266Z 
2025-05-07T20:32:13.1880354Z     @given(
2025-05-07T20:32:13.1880586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1880907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1881221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1881559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1881887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1882193Z     )
2025-05-07T20:32:13.1882550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1883000Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1883250Z         self,
2025-05-07T20:32:13.1883449Z         T: int,
2025-05-07T20:32:13.1883649Z         D: int,
2025-05-07T20:32:13.1883874Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1884154Z         contiguous: bool,
2025-05-07T20:32:13.1884389Z         compiled: bool,
2025-05-07T20:32:13.1884620Z     ) -> None:
2025-05-07T20:32:13.1885200Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1885444Z     
2025-05-07T20:32:13.1885718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1886081Z     
2025-05-07T20:32:13.1886281Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1886574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1886891Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1887143Z         x0 = x[:, :D]
2025-05-07T20:32:13.1887363Z         x1 = x[:, D:]
2025-05-07T20:32:13.1887574Z     
2025-05-07T20:32:13.1887765Z         if contiguous:
2025-05-07T20:32:13.1887999Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1888264Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1888510Z     
2025-05-07T20:32:13.1888710Z         if scale_ub is not None:
2025-05-07T20:32:13.1888983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1889328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1889651Z             )
2025-05-07T20:32:13.1889842Z         else:
2025-05-07T20:32:13.1890060Z             scale_ub_tensor = None
2025-05-07T20:32:13.1890314Z     
2025-05-07T20:32:13.1890545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1890867Z             op = silu_mul_quant
2025-05-07T20:32:13.1891127Z             if compiled:
2025-05-07T20:32:13.1891372Z                 op = torch.compile(op)
2025-05-07T20:32:13.1891882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1892163Z     
2025-05-07T20:32:13.1892353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1892526Z 
2025-05-07T20:32:13.1892626Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1892931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1893356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1893638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1894364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1895077Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1895623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1896331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1897021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1897574Z     kernel = self.compile(
2025-05-07T20:32:13.1898125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1898800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1899210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1899445Z 
2025-05-07T20:32:13.1899665Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638ae3b60>
2025-05-07T20:32:13.1900775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1902208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce38cb80>}
2025-05-07T20:32:13.1903604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1904666Z context = <triton._C.libtriton.ir.context object at 0x7f9638f41830>
2025-05-07T20:32:13.1904960Z 
2025-05-07T20:32:13.1905130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1905755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1906505Z                            module_map=module_map)
2025-05-07T20:32:13.1906877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1907234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1907501Z E       ^
2025-05-07T20:32:13.1907978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1908442Z 
2025-05-07T20:32:13.1908870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1909407Z 
2025-05-07T20:32:13.1909511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1909942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1910356Z     T=4096,
2025-05-07T20:32:13.1910549Z     D=7168,
2025-05-07T20:32:13.1910749Z     scale_ub=1200.0,
2025-05-07T20:32:13.1910984Z     contiguous=False,
2025-05-07T20:32:13.1911211Z     compiled=False,
2025-05-07T20:32:13.1911424Z )
2025-05-07T20:32:13.1911749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1912257Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.1912548Z 
2025-05-07T20:32:13.1912627Z     @given(
2025-05-07T20:32:13.1912936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1913256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1913564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1913900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1914237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1914590Z     )
2025-05-07T20:32:13.1914944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1915395Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1915639Z         self,
2025-05-07T20:32:13.1915839Z         T: int,
2025-05-07T20:32:13.1916039Z         D: int,
2025-05-07T20:32:13.1916267Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1916585Z         contiguous: bool,
2025-05-07T20:32:13.1916828Z         compiled: bool,
2025-05-07T20:32:13.1917051Z     ) -> None:
2025-05-07T20:32:13.1917271Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1917520Z     
2025-05-07T20:32:13.1917797Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1918139Z     
2025-05-07T20:32:13.1918340Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1918637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1918964Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1919216Z         x0 = x[:, :D]
2025-05-07T20:32:13.1919438Z         x1 = x[:, D:]
2025-05-07T20:32:13.1919642Z     
2025-05-07T20:32:13.1919832Z         if contiguous:
2025-05-07T20:32:13.1920067Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1920334Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1920574Z     
2025-05-07T20:32:13.1920772Z         if scale_ub is not None:
2025-05-07T20:32:13.1921047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1921381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1921697Z             )
2025-05-07T20:32:13.1921896Z         else:
2025-05-07T20:32:13.1922108Z             scale_ub_tensor = None
2025-05-07T20:32:13.1922365Z     
2025-05-07T20:32:13.1922605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1922923Z             op = silu_mul_quant
2025-05-07T20:32:13.1923184Z             if compiled:
2025-05-07T20:32:13.1923437Z                 op = torch.compile(op)
2025-05-07T20:32:13.1923734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1924014Z     
2025-05-07T20:32:13.1924212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1924377Z 
2025-05-07T20:32:13.1924486Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1924912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1925256Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1925543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1926252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1926964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1927519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1928217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1928906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1929457Z     kernel = self.compile(
2025-05-07T20:32:13.1930016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1930686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1931091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1931329Z 
2025-05-07T20:32:13.1931539Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396afa70>
2025-05-07T20:32:13.1932756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1934177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4879800>}
2025-05-07T20:32:13.1935616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1936678Z context = <triton._C.libtriton.ir.context object at 0x7f96386a25b0>
2025-05-07T20:32:13.1936970Z 
2025-05-07T20:32:13.1937148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1937685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1938160Z                            module_map=module_map)
2025-05-07T20:32:13.1938531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1938892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1939151Z E       ^
2025-05-07T20:32:13.1939629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1940094Z 
2025-05-07T20:32:13.1940532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1941063Z 
2025-05-07T20:32:13.1941170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1941589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1942003Z     T=16384,
2025-05-07T20:32:13.1942201Z     D=7168,
2025-05-07T20:32:13.1942393Z     scale_ub=None,
2025-05-07T20:32:13.1942615Z     contiguous=True,
2025-05-07T20:32:13.1942845Z     compiled=True,
2025-05-07T20:32:13.1943045Z )
2025-05-07T20:32:13.3696125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3696880Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.3697169Z 
2025-05-07T20:32:13.3697260Z     @given(
2025-05-07T20:32:13.3697510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3697829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3698139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3698815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3699153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3699447Z     )
2025-05-07T20:32:13.3699805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3700250Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3700496Z         self,
2025-05-07T20:32:13.3700698Z         T: int,
2025-05-07T20:32:13.3700898Z         D: int,
2025-05-07T20:32:13.3701119Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3701397Z         contiguous: bool,
2025-05-07T20:32:13.3701637Z         compiled: bool,
2025-05-07T20:32:13.3701871Z     ) -> None:
2025-05-07T20:32:13.3702093Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3702335Z     
2025-05-07T20:32:13.3702609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3702955Z     
2025-05-07T20:32:13.3703146Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.3703449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.3703762Z         x = x_sign * x_clamp
2025-05-07T20:32:13.3704001Z         x0 = x[:, :D]
2025-05-07T20:32:13.3704228Z         x1 = x[:, D:]
2025-05-07T20:32:13.3704438Z     
2025-05-07T20:32:13.3704627Z         if contiguous:
2025-05-07T20:32:13.3704856Z             x0 = x0.contiguous()
2025-05-07T20:32:13.3705120Z             x1 = x1.contiguous()
2025-05-07T20:32:13.3705450Z     
2025-05-07T20:32:13.3705638Z         if scale_ub is not None:
2025-05-07T20:32:13.3705914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.3706570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.3706881Z             )
2025-05-07T20:32:13.3707077Z         else:
2025-05-07T20:32:13.3707387Z             scale_ub_tensor = None
2025-05-07T20:32:13.3707635Z     
2025-05-07T20:32:13.3707868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.3708188Z             op = silu_mul_quant
2025-05-07T20:32:13.3708444Z             if compiled:
2025-05-07T20:32:13.3708695Z                 op = torch.compile(op)
2025-05-07T20:32:13.3708995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3709270Z     
2025-05-07T20:32:13.3709466Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.3709661Z 
2025-05-07T20:32:13.3709762Z moe/activation_test.py:117: 
2025-05-07T20:32:13.3710066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3710431Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.3710716Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3711293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.3711871Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.3712545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.3713252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.3713807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.3714509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.3715186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.3715741Z     kernel = self.compile(
2025-05-07T20:32:13.3716298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.3716969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.3717382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3717627Z 
2025-05-07T20:32:13.3717841Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396aeab0>
2025-05-07T20:32:13.3719105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.3720546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf3f0860>}
2025-05-07T20:32:13.3732285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.3733502Z context = <triton._C.libtriton.ir.context object at 0x7f9638600e70>
2025-05-07T20:32:13.3733805Z 
2025-05-07T20:32:13.3733986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.3734528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.3735010Z                            module_map=module_map)
2025-05-07T20:32:13.3735388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.3735751Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.3736017Z E       ^
2025-05-07T20:32:13.3736493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.3737090Z 
2025-05-07T20:32:13.3737523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.3738056Z 
2025-05-07T20:32:13.3738165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3738588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3739042Z     T=4096,
2025-05-07T20:32:13.3739231Z     D=5120,
2025-05-07T20:32:13.3739423Z     scale_ub=None,
2025-05-07T20:32:13.3739645Z     contiguous=False,
2025-05-07T20:32:13.3739875Z     compiled=True,
2025-05-07T20:32:13.3740087Z )
2025-05-07T20:32:13.3740412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3740919Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.3741198Z 
2025-05-07T20:32:13.3741283Z     @given(
2025-05-07T20:32:13.3741512Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3741829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3742141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3742480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3742807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3743095Z     )
2025-05-07T20:32:13.3743451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3743902Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3744147Z         self,
2025-05-07T20:32:13.3744347Z         T: int,
2025-05-07T20:32:13.3744543Z         D: int,
2025-05-07T20:32:13.3744768Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3745043Z         contiguous: bool,
2025-05-07T20:32:13.3745279Z         compiled: bool,
2025-05-07T20:32:13.3745506Z     ) -> None:
2025-05-07T20:32:13.3745722Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3745961Z     
2025-05-07T20:32:13.3746237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3746588Z     
2025-05-07T20:32:13.3746779Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.3747075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.3747390Z         x = x_sign * x_clamp
2025-05-07T20:32:13.3747634Z         x0 = x[:, :D]
2025-05-07T20:32:13.3747848Z         x1 = x[:, D:]
2025-05-07T20:32:13.3748061Z     
2025-05-07T20:32:13.3748247Z         if contiguous:
2025-05-07T20:32:13.3748475Z             x0 = x0.contiguous()
2025-05-07T20:32:13.3748736Z             x1 = x1.contiguous()
2025-05-07T20:32:13.3748981Z     
2025-05-07T20:32:13.3749255Z         if scale_ub is not None:
2025-05-07T20:32:13.3749532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.3749878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.3750188Z             )
2025-05-07T20:32:13.3750385Z         else:
2025-05-07T20:32:13.3750600Z             scale_ub_tensor = None
2025-05-07T20:32:13.3750848Z     
2025-05-07T20:32:13.3751083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.3751408Z             op = silu_mul_quant
2025-05-07T20:32:13.3751658Z             if compiled:
2025-05-07T20:32:13.3751911Z                 op = torch.compile(op)
2025-05-07T20:32:13.3752216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3752494Z     
2025-05-07T20:32:13.3752689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.3752862Z 
2025-05-07T20:32:13.3752967Z moe/activation_test.py:117: 
2025-05-07T20:32:13.3753270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3753611Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.3753900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3754475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.3755048Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.3755727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.3756557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.3757109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.3757806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.3758537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.3759085Z     kernel = self.compile(
2025-05-07T20:32:13.3759642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.3760325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.3760734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3760967Z 
2025-05-07T20:32:13.3761184Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396af200>
2025-05-07T20:32:13.3762300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.3763729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a579c0>}
2025-05-07T20:32:13.3765129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.3766194Z context = <triton._C.libtriton.ir.context object at 0x7f9638f7b070>
2025-05-07T20:32:13.3766489Z 
2025-05-07T20:32:13.3766664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.3767200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.3767681Z                            module_map=module_map)
2025-05-07T20:32:13.3768054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.3768409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.3768670Z E       ^
2025-05-07T20:32:13.3769149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.3769611Z 
2025-05-07T20:32:13.3770129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.3770661Z 
2025-05-07T20:32:13.5223686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5224355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5224859Z     T=4096,
2025-05-07T20:32:13.5225052Z     D=5120,
2025-05-07T20:32:13.5225242Z     scale_ub=1200.0,
2025-05-07T20:32:13.5225491Z     contiguous=False,
2025-05-07T20:32:13.5225725Z     compiled=False,
2025-05-07T20:32:13.5225941Z )
2025-05-07T20:32:13.5226273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5226799Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.5227095Z 
2025-05-07T20:32:13.5227178Z     @given(
2025-05-07T20:32:13.5227423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5227752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5228078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5228427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5228771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5229071Z     )
2025-05-07T20:32:13.5229462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5229915Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5230420Z         self,
2025-05-07T20:32:13.5230625Z         T: int,
2025-05-07T20:32:13.5230832Z         D: int,
2025-05-07T20:32:13.5231055Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5231336Z         contiguous: bool,
2025-05-07T20:32:13.5231586Z         compiled: bool,
2025-05-07T20:32:13.5231818Z     ) -> None:
2025-05-07T20:32:13.5232136Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5232391Z     
2025-05-07T20:32:13.5232667Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5233023Z     
2025-05-07T20:32:13.5233230Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5233527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5233850Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5234103Z         x0 = x[:, :D]
2025-05-07T20:32:13.5234324Z         x1 = x[:, D:]
2025-05-07T20:32:13.5234541Z     
2025-05-07T20:32:13.5234743Z         if contiguous:
2025-05-07T20:32:13.5234987Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5235260Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5235511Z     
2025-05-07T20:32:13.5235707Z         if scale_ub is not None:
2025-05-07T20:32:13.5235990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5236340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5236662Z             )
2025-05-07T20:32:13.5236860Z         else:
2025-05-07T20:32:13.5237082Z             scale_ub_tensor = None
2025-05-07T20:32:13.5237346Z     
2025-05-07T20:32:13.5237583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5237917Z             op = silu_mul_quant
2025-05-07T20:32:13.5238182Z             if compiled:
2025-05-07T20:32:13.5238435Z                 op = torch.compile(op)
2025-05-07T20:32:13.5238742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5239031Z     
2025-05-07T20:32:13.5239231Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5239410Z 
2025-05-07T20:32:13.5239516Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5239833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5240181Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5240468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5241191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5241915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5242628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5243343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5244038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5244595Z     kernel = self.compile(
2025-05-07T20:32:13.5245153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5245851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5246269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5246510Z 
2025-05-07T20:32:13.5246731Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639191a30>
2025-05-07T20:32:13.5247862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5249305Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a56d40>}
2025-05-07T20:32:13.5250713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5251948Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f7b830>
2025-05-07T20:32:13.5252249Z 
2025-05-07T20:32:13.5252433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5252976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5253518Z                            module_map=module_map)
2025-05-07T20:32:13.5253897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5254264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5254536Z E       ^
2025-05-07T20:32:13.5255020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5255488Z 
2025-05-07T20:32:13.5255935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5256471Z 
2025-05-07T20:32:13.5256577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5257011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5257432Z     T=4096,
2025-05-07T20:32:13.5257623Z     D=5120,
2025-05-07T20:32:13.5257825Z     scale_ub=1200.0,
2025-05-07T20:32:13.5258060Z     contiguous=False,
2025-05-07T20:32:13.5258294Z     compiled=True,
2025-05-07T20:32:13.5258499Z )
2025-05-07T20:32:13.5258830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5259352Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.5259638Z 
2025-05-07T20:32:13.5259718Z     @given(
2025-05-07T20:32:13.5259960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5260287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5260601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5260945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5261284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5261575Z     )
2025-05-07T20:32:13.5261941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5262397Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5262650Z         self,
2025-05-07T20:32:13.5262849Z         T: int,
2025-05-07T20:32:13.5263053Z         D: int,
2025-05-07T20:32:13.5263285Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5263558Z         contiguous: bool,
2025-05-07T20:32:13.5263921Z         compiled: bool,
2025-05-07T20:32:13.5264156Z     ) -> None:
2025-05-07T20:32:13.5264376Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5264627Z     
2025-05-07T20:32:13.5264912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5265261Z     
2025-05-07T20:32:13.5265463Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5265763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5266084Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5266335Z         x0 = x[:, :D]
2025-05-07T20:32:13.5266563Z         x1 = x[:, D:]
2025-05-07T20:32:13.5266772Z     
2025-05-07T20:32:13.5266962Z         if contiguous:
2025-05-07T20:32:13.5267204Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5267476Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5267721Z     
2025-05-07T20:32:13.5267921Z         if scale_ub is not None:
2025-05-07T20:32:13.5268205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5268550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5268870Z             )
2025-05-07T20:32:13.5269072Z         else:
2025-05-07T20:32:13.5269286Z             scale_ub_tensor = None
2025-05-07T20:32:13.5269545Z     
2025-05-07T20:32:13.5269785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5270106Z             op = silu_mul_quant
2025-05-07T20:32:13.5270423Z             if compiled:
2025-05-07T20:32:13.5270683Z                 op = torch.compile(op)
2025-05-07T20:32:13.5270983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5271272Z     
2025-05-07T20:32:13.5271475Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5271647Z 
2025-05-07T20:32:13.5271755Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5272106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5272451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5272742Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5273321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.5273903Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.5274586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5275293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5275853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5276614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5277301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5277852Z     kernel = self.compile(
2025-05-07T20:32:13.5278415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5279102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5279519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5279757Z 
2025-05-07T20:32:13.5279970Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639193aa0>
2025-05-07T20:32:13.5281088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5282518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a549a0>}
2025-05-07T20:32:13.5284002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5285067Z context = <triton._C.libtriton.ir.context object at 0x7f96387d1230>
2025-05-07T20:32:13.5285371Z 
2025-05-07T20:32:13.5285546Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5286090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5286577Z                            module_map=module_map)
2025-05-07T20:32:13.5286949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5287316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5287586Z E       ^
2025-05-07T20:32:13.5288062Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5288539Z 
2025-05-07T20:32:13.5288974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5289515Z 
2025-05-07T20:32:13.6427701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.6428339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.6428778Z     T=2048,
2025-05-07T20:32:13.6428974Z     D=7168,
2025-05-07T20:32:13.6429171Z     scale_ub=1200.0,
2025-05-07T20:32:13.6429409Z     contiguous=False,
2025-05-07T20:32:13.6429644Z     compiled=False,
2025-05-07T20:32:13.6430133Z )
2025-05-07T20:32:13.6430476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.6431005Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.6431297Z 
2025-05-07T20:32:13.6431387Z     @given(
2025-05-07T20:32:13.6431627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.6432041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.6432370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.6432711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.6433059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.6433362Z     )
2025-05-07T20:32:13.6433724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.6434184Z     def test_silu_mul_quant(
2025-05-07T20:32:13.6434440Z         self,
2025-05-07T20:32:13.6434643Z         T: int,
2025-05-07T20:32:13.6434854Z         D: int,
2025-05-07T20:32:13.6435086Z         scale_ub: Optional[float],
2025-05-07T20:32:13.6435367Z         contiguous: bool,
2025-05-07T20:32:13.6435610Z         compiled: bool,
2025-05-07T20:32:13.6435848Z     ) -> None:
2025-05-07T20:32:13.6436074Z         torch.manual_seed(2025)
2025-05-07T20:32:13.6436316Z     
2025-05-07T20:32:13.6436599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.6436962Z     
2025-05-07T20:32:13.6437163Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.6437465Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.6437797Z         x = x_sign * x_clamp
2025-05-07T20:32:13.6438039Z         x0 = x[:, :D]
2025-05-07T20:32:13.6438265Z         x1 = x[:, D:]
2025-05-07T20:32:13.6438483Z     
2025-05-07T20:32:13.6438668Z         if contiguous:
2025-05-07T20:32:13.6438910Z             x0 = x0.contiguous()
2025-05-07T20:32:13.6439178Z             x1 = x1.contiguous()
2025-05-07T20:32:13.6439420Z     
2025-05-07T20:32:13.6439624Z         if scale_ub is not None:
2025-05-07T20:32:13.6439906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.6440247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.6440569Z             )
2025-05-07T20:32:13.6440775Z         else:
2025-05-07T20:32:13.6440997Z             scale_ub_tensor = None
2025-05-07T20:32:13.6441253Z     
2025-05-07T20:32:13.6441496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.6441825Z             op = silu_mul_quant
2025-05-07T20:32:13.6442079Z             if compiled:
2025-05-07T20:32:13.6442493Z                 op = torch.compile(op)
2025-05-07T20:32:13.6442807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.6443089Z     
2025-05-07T20:32:13.6443293Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.6443463Z 
2025-05-07T20:32:13.6443575Z moe/activation_test.py:117: 
2025-05-07T20:32:13.6443875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.6444222Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.6444516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.6445238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.6445954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.6446525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.6447237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.6447928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.6448490Z     kernel = self.compile(
2025-05-07T20:32:13.6449054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.6449741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.6450206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.6450451Z 
2025-05-07T20:32:13.6450666Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639efb770>
2025-05-07T20:32:13.6451870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.6453387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a556c0>}
2025-05-07T20:32:13.6454920Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.6455998Z context = <triton._C.libtriton.ir.context object at 0x7f9638117ab0>
2025-05-07T20:32:13.6456306Z 
2025-05-07T20:32:13.6456482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.6457029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.6457509Z                            module_map=module_map)
2025-05-07T20:32:13.6457891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.6458264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.6458537Z E       ^
2025-05-07T20:32:13.6459026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.6459501Z 
2025-05-07T20:32:13.6459936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.6460473Z 
2025-05-07T20:32:13.6460592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.6461029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.6461447Z     T=1,
2025-05-07T20:32:13.6461648Z     D=7168,
2025-05-07T20:32:13.6461853Z     scale_ub=None,
2025-05-07T20:32:13.6462076Z     contiguous=True,
2025-05-07T20:32:13.6462314Z     compiled=False,
2025-05-07T20:32:13.6462537Z )
2025-05-07T20:32:13.6462867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.6463377Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.6463668Z 
2025-05-07T20:32:13.6463852Z     @given(
2025-05-07T20:32:13.6464096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.6464426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.6464749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.6465096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.6465439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.6465745Z     )
2025-05-07T20:32:13.6466118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.6466578Z     def test_silu_mul_quant(
2025-05-07T20:32:13.6466835Z         self,
2025-05-07T20:32:13.6467046Z         T: int,
2025-05-07T20:32:13.6467254Z         D: int,
2025-05-07T20:32:13.6467491Z         scale_ub: Optional[float],
2025-05-07T20:32:13.6467781Z         contiguous: bool,
2025-05-07T20:32:13.6468031Z         compiled: bool,
2025-05-07T20:32:13.6468270Z     ) -> None:
2025-05-07T20:32:13.6468498Z         torch.manual_seed(2025)
2025-05-07T20:32:13.6468750Z     
2025-05-07T20:32:13.6469033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.6479717Z     
2025-05-07T20:32:13.6479970Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.6480272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.6480593Z         x = x_sign * x_clamp
2025-05-07T20:32:13.6480832Z         x0 = x[:, :D]
2025-05-07T20:32:13.6481148Z         x1 = x[:, D:]
2025-05-07T20:32:13.6481371Z     
2025-05-07T20:32:13.6481565Z         if contiguous:
2025-05-07T20:32:13.6481793Z             x0 = x0.contiguous()
2025-05-07T20:32:13.6482057Z             x1 = x1.contiguous()
2025-05-07T20:32:13.6482298Z     
2025-05-07T20:32:13.6482484Z         if scale_ub is not None:
2025-05-07T20:32:13.6482812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.6483159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.6483476Z             )
2025-05-07T20:32:13.6483688Z         else:
2025-05-07T20:32:13.6483917Z             scale_ub_tensor = None
2025-05-07T20:32:13.6484169Z     
2025-05-07T20:32:13.6484409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.6484739Z             op = silu_mul_quant
2025-05-07T20:32:13.6484991Z             if compiled:
2025-05-07T20:32:13.6485247Z                 op = torch.compile(op)
2025-05-07T20:32:13.6485555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.6485840Z     
2025-05-07T20:32:13.6486037Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.6486211Z 
2025-05-07T20:32:13.6486312Z moe/activation_test.py:117: 
2025-05-07T20:32:13.6486627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.6486971Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.6487265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.6487990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.6488699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.6489269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.6489982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.6490672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.6491222Z     kernel = self.compile(
2025-05-07T20:32:13.6491861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.6492550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.6492970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.6493214Z 
2025-05-07T20:32:13.6493428Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf000920>
2025-05-07T20:32:13.6494647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.6496201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963914de40>}
2025-05-07T20:32:13.6497603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.6498662Z context = <triton._C.libtriton.ir.context object at 0x7f96387ca430>
2025-05-07T20:32:13.6498965Z 
2025-05-07T20:32:13.6499137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.6499690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.6500180Z                            module_map=module_map)
2025-05-07T20:32:13.6500550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.6500919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.6501186Z E       ^
2025-05-07T20:32:13.6501663Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.6502196Z 
2025-05-07T20:32:13.6502630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.6503172Z 
2025-05-07T20:32:13.6503279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.6503759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.6504172Z     T=16384,
2025-05-07T20:32:13.6504375Z     D=7168,
2025-05-07T20:32:13.6504576Z     scale_ub=1200.0,
2025-05-07T20:32:13.6504807Z     contiguous=False,
2025-05-07T20:32:13.6505044Z     compiled=True,
2025-05-07T20:32:13.8889238Z )
2025-05-07T20:32:13.8889810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8890373Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.8890675Z 
2025-05-07T20:32:13.8890755Z     @given(
2025-05-07T20:32:13.8890994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8891344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8891652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8892082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8892423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8892713Z     )
2025-05-07T20:32:13.8893081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8893536Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8893779Z         self,
2025-05-07T20:32:13.8893993Z         T: int,
2025-05-07T20:32:13.8894199Z         D: int,
2025-05-07T20:32:13.8894417Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8894695Z         contiguous: bool,
2025-05-07T20:32:13.8894952Z         compiled: bool,
2025-05-07T20:32:13.8895185Z     ) -> None:
2025-05-07T20:32:13.8895411Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8895662Z     
2025-05-07T20:32:13.8895946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8896292Z     
2025-05-07T20:32:13.8896494Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8896797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8897118Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8897368Z         x0 = x[:, :D]
2025-05-07T20:32:13.8897593Z         x1 = x[:, D:]
2025-05-07T20:32:13.8897803Z     
2025-05-07T20:32:13.8897999Z         if contiguous:
2025-05-07T20:32:13.8898239Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8898871Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8899126Z     
2025-05-07T20:32:13.8899332Z         if scale_ub is not None:
2025-05-07T20:32:13.8899607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8899953Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8900274Z             )
2025-05-07T20:32:13.8900467Z         else:
2025-05-07T20:32:13.8900692Z             scale_ub_tensor = None
2025-05-07T20:32:13.8900957Z     
2025-05-07T20:32:13.8901197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8901518Z             op = silu_mul_quant
2025-05-07T20:32:13.8901779Z             if compiled:
2025-05-07T20:32:13.8902035Z                 op = torch.compile(op)
2025-05-07T20:32:13.8902334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8902624Z     
2025-05-07T20:32:13.8902830Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8902999Z 
2025-05-07T20:32:13.8903101Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8903422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8903772Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8904058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8904638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.8905222Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.8905993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8907084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8907642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8908453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8909141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8909692Z     kernel = self.compile(
2025-05-07T20:32:13.8910251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8910933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8911338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8911585Z 
2025-05-07T20:32:13.8911798Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d45f4d70>
2025-05-07T20:32:13.8912921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8914369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963914fb00>}
2025-05-07T20:32:13.8915767Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8916824Z context = <triton._C.libtriton.ir.context object at 0x7f94f7ff18f0>
2025-05-07T20:32:13.8917127Z 
2025-05-07T20:32:13.8917301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8917875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8918360Z                            module_map=module_map)
2025-05-07T20:32:13.8918737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8919098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8919370Z E       ^
2025-05-07T20:32:13.8919851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8920442Z 
2025-05-07T20:32:13.8920874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8921410Z 
2025-05-07T20:32:13.8921518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8921948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8922366Z     T=1,
2025-05-07T20:32:13.8922557Z     D=7168,
2025-05-07T20:32:13.8922759Z     scale_ub=None,
2025-05-07T20:32:13.8922983Z     contiguous=False,
2025-05-07T20:32:13.8923215Z     compiled=False,
2025-05-07T20:32:13.8923429Z )
2025-05-07T20:32:13.8923763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8924269Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.8924547Z 
2025-05-07T20:32:13.8924628Z     @given(
2025-05-07T20:32:13.8924871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8925197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8925517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8925860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8926204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8926498Z     )
2025-05-07T20:32:13.8926864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8927393Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8927639Z         self,
2025-05-07T20:32:13.8927846Z         T: int,
2025-05-07T20:32:13.8928055Z         D: int,
2025-05-07T20:32:13.8928277Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8928565Z         contiguous: bool,
2025-05-07T20:32:13.8928868Z         compiled: bool,
2025-05-07T20:32:13.8929096Z     ) -> None:
2025-05-07T20:32:13.8929322Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8929575Z     
2025-05-07T20:32:13.8929859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8930216Z     
2025-05-07T20:32:13.8930417Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8930712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8931032Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8931280Z         x0 = x[:, :D]
2025-05-07T20:32:13.8931522Z         x1 = x[:, D:]
2025-05-07T20:32:13.8931731Z     
2025-05-07T20:32:13.8932019Z         if contiguous:
2025-05-07T20:32:13.8932260Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8932523Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8932776Z     
2025-05-07T20:32:13.8932976Z         if scale_ub is not None:
2025-05-07T20:32:13.8933259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8933601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8933927Z             )
2025-05-07T20:32:13.8934130Z         else:
2025-05-07T20:32:13.8934344Z             scale_ub_tensor = None
2025-05-07T20:32:13.8934606Z     
2025-05-07T20:32:13.8934852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8935173Z             op = silu_mul_quant
2025-05-07T20:32:13.8935434Z             if compiled:
2025-05-07T20:32:13.8935690Z                 op = torch.compile(op)
2025-05-07T20:32:13.8935994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8936283Z     
2025-05-07T20:32:13.8936482Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8936652Z 
2025-05-07T20:32:13.8936755Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8937062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8937412Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8937702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8938408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8939121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8939795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8940496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8941187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8941740Z     kernel = self.compile(
2025-05-07T20:32:13.8942303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8942981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8943394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8943633Z 
2025-05-07T20:32:13.8943852Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175f10>
2025-05-07T20:32:13.8944979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8946440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce1200e0>}
2025-05-07T20:32:13.8947842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8948948Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f995f0>
2025-05-07T20:32:13.8949247Z 
2025-05-07T20:32:13.8949424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8950001Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8950486Z                            module_map=module_map)
2025-05-07T20:32:13.8950869Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8951237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8951501Z E       ^
2025-05-07T20:32:13.8951984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8952449Z 
2025-05-07T20:32:13.8952890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8953425Z 
2025-05-07T20:32:13.8953539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8953963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8954385Z     T=2048,
2025-05-07T20:32:13.8954586Z     D=7168,
2025-05-07T20:32:13.8954779Z     scale_ub=None,
2025-05-07T20:32:13.8955007Z     contiguous=False,
2025-05-07T20:32:13.8955243Z     compiled=True,
2025-05-07T20:32:13.8955447Z )
2025-05-07T20:32:13.9822335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9823123Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.9823411Z 
2025-05-07T20:32:13.9823491Z     @given(
2025-05-07T20:32:13.9823732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9824054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9824366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9824708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9825044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9825338Z     )
2025-05-07T20:32:13.9825692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9826157Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9826408Z         self,
2025-05-07T20:32:13.9826605Z         T: int,
2025-05-07T20:32:13.9826810Z         D: int,
2025-05-07T20:32:13.9827038Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9827617Z         contiguous: bool,
2025-05-07T20:32:13.9827872Z         compiled: bool,
2025-05-07T20:32:13.9828109Z     ) -> None:
2025-05-07T20:32:13.9828328Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9828580Z     
2025-05-07T20:32:13.9828864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9829214Z     
2025-05-07T20:32:13.9829421Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9829725Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9830046Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9830299Z         x0 = x[:, :D]
2025-05-07T20:32:13.9830527Z         x1 = x[:, D:]
2025-05-07T20:32:13.9830740Z     
2025-05-07T20:32:13.9830934Z         if contiguous:
2025-05-07T20:32:13.9831181Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9831457Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9831700Z     
2025-05-07T20:32:13.9831898Z         if scale_ub is not None:
2025-05-07T20:32:13.9832187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9832528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9832854Z             )
2025-05-07T20:32:13.9833058Z         else:
2025-05-07T20:32:13.9833273Z             scale_ub_tensor = None
2025-05-07T20:32:13.9833536Z     
2025-05-07T20:32:13.9833779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9834184Z             op = silu_mul_quant
2025-05-07T20:32:13.9834443Z             if compiled:
2025-05-07T20:32:13.9834700Z                 op = torch.compile(op)
2025-05-07T20:32:13.9834996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9835281Z     
2025-05-07T20:32:13.9835483Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9835731Z 
2025-05-07T20:32:13.9835839Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9836140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9836486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9836781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9837358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.9837943Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.9838627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9839342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9839893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9840607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9841301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9841856Z     kernel = self.compile(
2025-05-07T20:32:13.9842421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9843104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9843518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9843755Z 
2025-05-07T20:32:13.9843969Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639176000>
2025-05-07T20:32:13.9845093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9846533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a56ac0>}
2025-05-07T20:32:13.9848038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9849103Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f1ff70>
2025-05-07T20:32:13.9849408Z 
2025-05-07T20:32:13.9849582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9850128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9850617Z                            module_map=module_map)
2025-05-07T20:32:13.9850989Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9851359Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9851631Z E       ^
2025-05-07T20:32:13.9852184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9852660Z 
2025-05-07T20:32:13.9853098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9853642Z 
2025-05-07T20:32:13.9853749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9854185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9854604Z     T=4096,
2025-05-07T20:32:13.9854805Z     D=7168,
2025-05-07T20:32:13.9855009Z     scale_ub=None,
2025-05-07T20:32:13.9855229Z     contiguous=False,
2025-05-07T20:32:13.9855522Z     compiled=True,
2025-05-07T20:32:13.9855738Z )
2025-05-07T20:32:13.9856071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9856593Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.9856883Z 
2025-05-07T20:32:13.9856964Z     @given(
2025-05-07T20:32:13.9857246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9857566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9857885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9858232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9858567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9858863Z     )
2025-05-07T20:32:13.9859225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9859686Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9859931Z         self,
2025-05-07T20:32:13.9860136Z         T: int,
2025-05-07T20:32:13.9860343Z         D: int,
2025-05-07T20:32:13.9860564Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9860845Z         contiguous: bool,
2025-05-07T20:32:13.9861095Z         compiled: bool,
2025-05-07T20:32:13.9861321Z     ) -> None:
2025-05-07T20:32:13.9861546Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9861801Z     
2025-05-07T20:32:13.9862078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9862433Z     
2025-05-07T20:32:13.9862634Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9862935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9863259Z         x = x_sign * x_clamp
2025-05-07T20:32:13.9863508Z         x0 = x[:, :D]
2025-05-07T20:32:13.9863728Z         x1 = x[:, D:]
2025-05-07T20:32:13.9863943Z     
2025-05-07T20:32:13.9864138Z         if contiguous:
2025-05-07T20:32:13.9864373Z             x0 = x0.contiguous()
2025-05-07T20:32:13.9864640Z             x1 = x1.contiguous()
2025-05-07T20:32:13.9864892Z     
2025-05-07T20:32:13.9865086Z         if scale_ub is not None:
2025-05-07T20:32:13.9865371Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.9865720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.9866048Z             )
2025-05-07T20:32:13.9866246Z         else:
2025-05-07T20:32:13.9866472Z             scale_ub_tensor = None
2025-05-07T20:32:13.9866735Z     
2025-05-07T20:32:13.9866969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.9867318Z             op = silu_mul_quant
2025-05-07T20:32:13.9867664Z             if compiled:
2025-05-07T20:32:13.9867929Z                 op = torch.compile(op)
2025-05-07T20:32:13.9868241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9868520Z     
2025-05-07T20:32:13.9868720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9868888Z 
2025-05-07T20:32:13.9869001Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9869306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9869658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9869951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9870526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.9871112Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.9871797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9872518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9873072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9873783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9874482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9875083Z     kernel = self.compile(
2025-05-07T20:32:13.9875637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9876319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9886600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9887009Z 
2025-05-07T20:32:13.9887228Z self = <triton.compiler.compiler.ASTSource object at 0x7f963932f7a0>
2025-05-07T20:32:13.9888362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9889789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a55b20>}
2025-05-07T20:32:13.9891185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9892338Z context = <triton._C.libtriton.ir.context object at 0x7f96381890f0>
2025-05-07T20:32:13.9892636Z 
2025-05-07T20:32:13.9892805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9893343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9893834Z                            module_map=module_map)
2025-05-07T20:32:13.9894216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9894581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9894853Z E       ^
2025-05-07T20:32:13.9895339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9895810Z 
2025-05-07T20:32:13.9896240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9896783Z 
2025-05-07T20:32:14.1469443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1470097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1470540Z     T=16384,
2025-05-07T20:32:14.1470743Z     D=5120,
2025-05-07T20:32:14.1470948Z     scale_ub=1200.0,
2025-05-07T20:32:14.1471172Z     contiguous=False,
2025-05-07T20:32:14.1471407Z     compiled=False,
2025-05-07T20:32:14.1471984Z )
2025-05-07T20:32:14.1472318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1472843Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.1473134Z 
2025-05-07T20:32:14.1473219Z     @given(
2025-05-07T20:32:14.1473451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1473782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1474102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1474441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1474772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1475065Z     )
2025-05-07T20:32:14.1475425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1475882Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1476136Z         self,
2025-05-07T20:32:14.1476341Z         T: int,
2025-05-07T20:32:14.1476551Z         D: int,
2025-05-07T20:32:14.1476778Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1477059Z         contiguous: bool,
2025-05-07T20:32:14.1477304Z         compiled: bool,
2025-05-07T20:32:14.1477541Z     ) -> None:
2025-05-07T20:32:14.1477763Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1478005Z     
2025-05-07T20:32:14.1478283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1478719Z     
2025-05-07T20:32:14.1478914Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1479214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1479533Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1479784Z         x0 = x[:, :D]
2025-05-07T20:32:14.1480003Z         x1 = x[:, D:]
2025-05-07T20:32:14.1480299Z     
2025-05-07T20:32:14.1480494Z         if contiguous:
2025-05-07T20:32:14.1480730Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1480998Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1481252Z     
2025-05-07T20:32:14.1481450Z         if scale_ub is not None:
2025-05-07T20:32:14.1481732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1482080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1482392Z             )
2025-05-07T20:32:14.1482592Z         else:
2025-05-07T20:32:14.1482812Z             scale_ub_tensor = None
2025-05-07T20:32:14.1483067Z     
2025-05-07T20:32:14.1483312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1483639Z             op = silu_mul_quant
2025-05-07T20:32:14.1483896Z             if compiled:
2025-05-07T20:32:14.1484156Z                 op = torch.compile(op)
2025-05-07T20:32:14.1484467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1484760Z     
2025-05-07T20:32:14.1484955Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1485129Z 
2025-05-07T20:32:14.1485235Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1485543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1485887Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1486177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1486891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1487608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1488158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1488874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1489564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1490115Z     kernel = self.compile(
2025-05-07T20:32:14.1490676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1491448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1491969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1492209Z 
2025-05-07T20:32:14.1492424Z self = <triton.compiler.compiler.ASTSource object at 0x7f963932c6e0>
2025-05-07T20:32:14.1493547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1494993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a54c20>}
2025-05-07T20:32:14.1496391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1497457Z context = <triton._C.libtriton.ir.context object at 0x7f96381e72f0>
2025-05-07T20:32:14.1497761Z 
2025-05-07T20:32:14.1497932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1498474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1498961Z                            module_map=module_map)
2025-05-07T20:32:14.1499385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1499757Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1500027Z E       ^
2025-05-07T20:32:14.1500503Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1501021Z 
2025-05-07T20:32:14.1501454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1501994Z 
2025-05-07T20:32:14.1502101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1502535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1502949Z     T=16384,
2025-05-07T20:32:14.1503152Z     D=5120,
2025-05-07T20:32:14.1503360Z     scale_ub=1200.0,
2025-05-07T20:32:14.1503587Z     contiguous=True,
2025-05-07T20:32:14.1503818Z     compiled=True,
2025-05-07T20:32:14.1504033Z )
2025-05-07T20:32:14.1504359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1504879Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.1505171Z 
2025-05-07T20:32:14.1505253Z     @given(
2025-05-07T20:32:14.1505495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1505818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1506405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1506750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1507091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1507392Z     )
2025-05-07T20:32:14.1507758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1508220Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1508465Z         self,
2025-05-07T20:32:14.1508673Z         T: int,
2025-05-07T20:32:14.1508880Z         D: int,
2025-05-07T20:32:14.1509101Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1509382Z         contiguous: bool,
2025-05-07T20:32:14.1509631Z         compiled: bool,
2025-05-07T20:32:14.1509856Z     ) -> None:
2025-05-07T20:32:14.1510078Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1510332Z     
2025-05-07T20:32:14.1510606Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1510962Z     
2025-05-07T20:32:14.1511163Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1511457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1511779Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1512157Z         x0 = x[:, :D]
2025-05-07T20:32:14.1512381Z         x1 = x[:, D:]
2025-05-07T20:32:14.1512599Z     
2025-05-07T20:32:14.1512795Z         if contiguous:
2025-05-07T20:32:14.1513033Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1513300Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1513547Z     
2025-05-07T20:32:14.1513740Z         if scale_ub is not None:
2025-05-07T20:32:14.1514025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1514370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1514687Z             )
2025-05-07T20:32:14.1514883Z         else:
2025-05-07T20:32:14.1515102Z             scale_ub_tensor = None
2025-05-07T20:32:14.1515360Z     
2025-05-07T20:32:14.1515599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1515928Z             op = silu_mul_quant
2025-05-07T20:32:14.1516189Z             if compiled:
2025-05-07T20:32:14.1516438Z                 op = torch.compile(op)
2025-05-07T20:32:14.1516753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1517042Z     
2025-05-07T20:32:14.1517239Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1517413Z 
2025-05-07T20:32:14.1517516Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1517823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1518168Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1518540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1519126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.1519707Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.1520385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1521160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1521724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1522433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1523124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1523677Z     kernel = self.compile(
2025-05-07T20:32:14.1524242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1524923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1525343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1525587Z 
2025-05-07T20:32:14.1525806Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0a270>
2025-05-07T20:32:14.1526941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1528362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385af060>}
2025-05-07T20:32:14.1529749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1530814Z context = <triton._C.libtriton.ir.context object at 0x7f9638fe1b30>
2025-05-07T20:32:14.1531133Z 
2025-05-07T20:32:14.1531306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1531928Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1532416Z                            module_map=module_map)
2025-05-07T20:32:14.1532901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1533271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1533544Z E       ^
2025-05-07T20:32:14.1534021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1534495Z 
2025-05-07T20:32:14.1534929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1535473Z 
2025-05-07T20:32:14.3232577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.3233711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.3234537Z     T=16384,
2025-05-07T20:32:14.3234915Z     D=5120,
2025-05-07T20:32:14.3235324Z     scale_ub=None,
2025-05-07T20:32:14.3235748Z     contiguous=False,
2025-05-07T20:32:14.3236189Z     compiled=True,
2025-05-07T20:32:14.3236543Z )
2025-05-07T20:32:14.3236877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.3237405Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.3237704Z 
2025-05-07T20:32:14.3237785Z     @given(
2025-05-07T20:32:14.3238021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.3238347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.3238656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.3239249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.3239588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.3239876Z     )
2025-05-07T20:32:14.3240238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.3240694Z     def test_silu_mul_quant(
2025-05-07T20:32:14.3241016Z         self,
2025-05-07T20:32:14.3241219Z         T: int,
2025-05-07T20:32:14.3241421Z         D: int,
2025-05-07T20:32:14.3241637Z         scale_ub: Optional[float],
2025-05-07T20:32:14.3241922Z         contiguous: bool,
2025-05-07T20:32:14.3242179Z         compiled: bool,
2025-05-07T20:32:14.3242411Z     ) -> None:
2025-05-07T20:32:14.3242625Z         torch.manual_seed(2025)
2025-05-07T20:32:14.3242871Z     
2025-05-07T20:32:14.3243149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.3243496Z     
2025-05-07T20:32:14.3243694Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.3243995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.3244313Z         x = x_sign * x_clamp
2025-05-07T20:32:14.3244564Z         x0 = x[:, :D]
2025-05-07T20:32:14.3244788Z         x1 = x[:, D:]
2025-05-07T20:32:14.3244997Z     
2025-05-07T20:32:14.3245189Z         if contiguous:
2025-05-07T20:32:14.3245427Z             x0 = x0.contiguous()
2025-05-07T20:32:14.3245691Z             x1 = x1.contiguous()
2025-05-07T20:32:14.3245941Z     
2025-05-07T20:32:14.3246138Z         if scale_ub is not None:
2025-05-07T20:32:14.3246411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.3246767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.3247086Z             )
2025-05-07T20:32:14.3247279Z         else:
2025-05-07T20:32:14.3247500Z             scale_ub_tensor = None
2025-05-07T20:32:14.3247761Z     
2025-05-07T20:32:14.3248002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.3248324Z             op = silu_mul_quant
2025-05-07T20:32:14.3248587Z             if compiled:
2025-05-07T20:32:14.3248844Z                 op = torch.compile(op)
2025-05-07T20:32:14.3249144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.3249427Z     
2025-05-07T20:32:14.3249627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.3249796Z 
2025-05-07T20:32:14.3249897Z moe/activation_test.py:117: 
2025-05-07T20:32:14.3250208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.3250555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.3250989Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.3251575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.3252254Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.3252935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.3253648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.3254208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.3254917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.3255606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.3256156Z     kernel = self.compile(
2025-05-07T20:32:14.3256725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.3257412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.3257819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.3258068Z 
2025-05-07T20:32:14.3258280Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0a930>
2025-05-07T20:32:14.3259405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.3260900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638200b80>}
2025-05-07T20:32:14.3262344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.3263403Z context = <triton._C.libtriton.ir.context object at 0x7f9638f02130>
2025-05-07T20:32:14.3263711Z 
2025-05-07T20:32:14.3263883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.3264428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.3264918Z                            module_map=module_map)
2025-05-07T20:32:14.3265293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.3265661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.3265931Z E       ^
2025-05-07T20:32:14.3266409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.3266883Z 
2025-05-07T20:32:14.3267320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.3267869Z 
2025-05-07T20:32:14.3267976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.3268410Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.3268829Z     T=2048,
2025-05-07T20:32:14.3269029Z     D=5120,
2025-05-07T20:32:14.3269230Z     scale_ub=None,
2025-05-07T20:32:14.3269450Z     contiguous=False,
2025-05-07T20:32:14.3269685Z     compiled=True,
2025-05-07T20:32:14.3269901Z )
2025-05-07T20:32:14.4177077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.4177755Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.4178040Z 
2025-05-07T20:32:14.4178119Z     @given(
2025-05-07T20:32:14.4178350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.4178694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.4179000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.4179677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.4180015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.4180298Z     )
2025-05-07T20:32:14.4180648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.4181099Z     def test_silu_mul_quant(
2025-05-07T20:32:14.4181350Z         self,
2025-05-07T20:32:14.4181542Z         T: int,
2025-05-07T20:32:14.4181750Z         D: int,
2025-05-07T20:32:14.4181970Z         scale_ub: Optional[float],
2025-05-07T20:32:14.4182239Z         contiguous: bool,
2025-05-07T20:32:14.4182485Z         compiled: bool,
2025-05-07T20:32:14.4182719Z     ) -> None:
2025-05-07T20:32:14.4182934Z         torch.manual_seed(2025)
2025-05-07T20:32:14.4183177Z     
2025-05-07T20:32:14.4183457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.4183801Z     
2025-05-07T20:32:14.4183995Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.4184292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.4184614Z         x = x_sign * x_clamp
2025-05-07T20:32:14.4184853Z         x0 = x[:, :D]
2025-05-07T20:32:14.4185073Z         x1 = x[:, D:]
2025-05-07T20:32:14.4185283Z     
2025-05-07T20:32:14.4185467Z         if contiguous:
2025-05-07T20:32:14.4185700Z             x0 = x0.contiguous()
2025-05-07T20:32:14.4185965Z             x1 = x1.contiguous()
2025-05-07T20:32:14.4186200Z     
2025-05-07T20:32:14.4186475Z         if scale_ub is not None:
2025-05-07T20:32:14.4186752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.4187089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.4187404Z             )
2025-05-07T20:32:14.4187601Z         else:
2025-05-07T20:32:14.4187810Z             scale_ub_tensor = None
2025-05-07T20:32:14.4188148Z     
2025-05-07T20:32:14.4188385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.4188700Z             op = silu_mul_quant
2025-05-07T20:32:14.4188960Z             if compiled:
2025-05-07T20:32:14.4189218Z                 op = torch.compile(op)
2025-05-07T20:32:14.4189525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4189800Z     
2025-05-07T20:32:14.4189998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.4190164Z 
2025-05-07T20:32:14.4190270Z moe/activation_test.py:117: 
2025-05-07T20:32:14.4190572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4190920Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.4191213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4191782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.4192362Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.4193054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.4193762Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.4194317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.4195028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.4195721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.4196266Z     kernel = self.compile(
2025-05-07T20:32:14.4196832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.4197511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.4197920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4198156Z 
2025-05-07T20:32:14.4198367Z self = <triton.compiler.compiler.ASTSource object at 0x7f96394c9400>
2025-05-07T20:32:14.4199572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.4201007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96382020c0>}
2025-05-07T20:32:14.4202402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.4203467Z context = <triton._C.libtriton.ir.context object at 0x7f9638f6db30>
2025-05-07T20:32:14.4203761Z 
2025-05-07T20:32:14.4203932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.4204471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.4204953Z                            module_map=module_map)
2025-05-07T20:32:14.4205323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.4205686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.4205952Z E       ^
2025-05-07T20:32:14.4206795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.4207263Z 
2025-05-07T20:32:14.4207694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.4208300Z 
2025-05-07T20:32:14.4208406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.4208832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.4209241Z     T=2048,
2025-05-07T20:32:14.4209501Z     D=5120,
2025-05-07T20:32:14.4209701Z     scale_ub=1200.0,
2025-05-07T20:32:14.4209929Z     contiguous=False,
2025-05-07T20:32:14.4210151Z     compiled=True,
2025-05-07T20:32:14.4210359Z )
2025-05-07T20:32:14.4210692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.4211201Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.4211490Z 
2025-05-07T20:32:14.4211568Z     @given(
2025-05-07T20:32:14.4211880Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.4212194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.4212510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.4212868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.4213202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.4213492Z     )
2025-05-07T20:32:14.4213854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.4214313Z     def test_silu_mul_quant(
2025-05-07T20:32:14.4214554Z         self,
2025-05-07T20:32:14.4214754Z         T: int,
2025-05-07T20:32:14.4214958Z         D: int,
2025-05-07T20:32:14.4215181Z         scale_ub: Optional[float],
2025-05-07T20:32:14.4215461Z         contiguous: bool,
2025-05-07T20:32:14.4215710Z         compiled: bool,
2025-05-07T20:32:14.4215936Z     ) -> None:
2025-05-07T20:32:14.4216159Z         torch.manual_seed(2025)
2025-05-07T20:32:14.4216414Z     
2025-05-07T20:32:14.4216692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.4217044Z     
2025-05-07T20:32:14.4217248Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.4217540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.4217861Z         x = x_sign * x_clamp
2025-05-07T20:32:14.4228675Z         x0 = x[:, :D]
2025-05-07T20:32:14.4228976Z         x1 = x[:, D:]
2025-05-07T20:32:14.4229190Z     
2025-05-07T20:32:14.4229385Z         if contiguous:
2025-05-07T20:32:14.4229626Z             x0 = x0.contiguous()
2025-05-07T20:32:14.4229896Z             x1 = x1.contiguous()
2025-05-07T20:32:14.4230134Z     
2025-05-07T20:32:14.4230328Z         if scale_ub is not None:
2025-05-07T20:32:14.4230788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.4231152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.4231465Z             )
2025-05-07T20:32:14.4231660Z         else:
2025-05-07T20:32:14.4231863Z             scale_ub_tensor = None
2025-05-07T20:32:14.4232112Z     
2025-05-07T20:32:14.4232348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.4232675Z             op = silu_mul_quant
2025-05-07T20:32:14.4232927Z             if compiled:
2025-05-07T20:32:14.4233188Z                 op = torch.compile(op)
2025-05-07T20:32:14.4233491Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4233769Z     
2025-05-07T20:32:14.4233964Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.4234134Z 
2025-05-07T20:32:14.4234240Z moe/activation_test.py:117: 
2025-05-07T20:32:14.4234544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4234883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.4235170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.4235748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.4236320Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.4237048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.4237806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.4238355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.4239052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.4239777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.4240325Z     kernel = self.compile(
2025-05-07T20:32:14.4240882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.4241559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.4241967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.4242201Z 
2025-05-07T20:32:14.4242417Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0b920>
2025-05-07T20:32:14.4243533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.4244956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96382032e0>}
2025-05-07T20:32:14.4246355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.4247416Z context = <triton._C.libtriton.ir.context object at 0x7f94f7e7a8b0>
2025-05-07T20:32:14.4247711Z 
2025-05-07T20:32:14.4247886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.4248419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.4248905Z                            module_map=module_map)
2025-05-07T20:32:14.4249278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.4249637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.4249904Z E       ^
2025-05-07T20:32:14.4250383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.4250849Z 
2025-05-07T20:32:14.4251374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.4251979Z 
2025-05-07T20:32:14.5989914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.5990895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5991715Z     T=4096,
2025-05-07T20:32:14.5992079Z     D=5120,
2025-05-07T20:32:14.5992459Z     scale_ub=1200.0,
2025-05-07T20:32:14.5992922Z     contiguous=True,
2025-05-07T20:32:14.5993361Z     compiled=True,
2025-05-07T20:32:14.5993760Z )
2025-05-07T20:32:14.5994409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5995424Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.5995981Z 
2025-05-07T20:32:14.5996145Z     @given(
2025-05-07T20:32:14.5996461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5996780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5997088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5997452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5997789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5998084Z     )
2025-05-07T20:32:14.5998436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5998890Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5999134Z         self,
2025-05-07T20:32:14.5999600Z         T: int,
2025-05-07T20:32:14.5999792Z         D: int,
2025-05-07T20:32:14.6000010Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6000286Z         contiguous: bool,
2025-05-07T20:32:14.6000523Z         compiled: bool,
2025-05-07T20:32:14.6000757Z     ) -> None:
2025-05-07T20:32:14.6000977Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6001306Z     
2025-05-07T20:32:14.6001583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6001930Z     
2025-05-07T20:32:14.6002120Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6002421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6002739Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6002979Z         x0 = x[:, :D]
2025-05-07T20:32:14.6003199Z         x1 = x[:, D:]
2025-05-07T20:32:14.6003414Z     
2025-05-07T20:32:14.6003598Z         if contiguous:
2025-05-07T20:32:14.6003833Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6004098Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6004338Z     
2025-05-07T20:32:14.6004534Z         if scale_ub is not None:
2025-05-07T20:32:14.6004813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6005158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6005466Z             )
2025-05-07T20:32:14.6005667Z         else:
2025-05-07T20:32:14.6005879Z             scale_ub_tensor = None
2025-05-07T20:32:14.6006129Z     
2025-05-07T20:32:14.6006645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6006969Z             op = silu_mul_quant
2025-05-07T20:32:14.6007227Z             if compiled:
2025-05-07T20:32:14.6007481Z                 op = torch.compile(op)
2025-05-07T20:32:14.6007783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6008058Z     
2025-05-07T20:32:14.6008255Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6008420Z 
2025-05-07T20:32:14.6008527Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6008832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6009173Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6009463Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6010041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6010616Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6011295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6012257Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6012807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6013512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6014201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6014753Z     kernel = self.compile(
2025-05-07T20:32:14.6015310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6015994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6016405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6016642Z 
2025-05-07T20:32:14.6016860Z self = <triton.compiler.compiler.ASTSource object at 0x7f963946a840>
2025-05-07T20:32:14.6017980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6019417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7c860>}
2025-05-07T20:32:14.6020877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6021940Z context = <triton._C.libtriton.ir.context object at 0x7f94f7e6bb30>
2025-05-07T20:32:14.6022306Z 
2025-05-07T20:32:14.6022482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6023015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6023498Z                            module_map=module_map)
2025-05-07T20:32:14.6023872Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6024227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6024494Z E       ^
2025-05-07T20:32:14.6024971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6025438Z 
2025-05-07T20:32:14.6025872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6026401Z 
2025-05-07T20:32:14.6026508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6026933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6027349Z     T=128,
2025-05-07T20:32:14.6027537Z     D=5120,
2025-05-07T20:32:14.6027732Z     scale_ub=1200.0,
2025-05-07T20:32:14.6027961Z     contiguous=False,
2025-05-07T20:32:14.6028184Z     compiled=True,
2025-05-07T20:32:14.6028398Z )
2025-05-07T20:32:14.8815578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8816197Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.8816478Z 
2025-05-07T20:32:14.8816561Z     @given(
2025-05-07T20:32:14.8816826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8817171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8817481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8817816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8818143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8818433Z     )
2025-05-07T20:32:14.8818797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8819246Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8819497Z         self,
2025-05-07T20:32:14.8819699Z         T: int,
2025-05-07T20:32:14.8820279Z         D: int,
2025-05-07T20:32:14.8820511Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8820789Z         contiguous: bool,
2025-05-07T20:32:14.8821035Z         compiled: bool,
2025-05-07T20:32:14.8821258Z     ) -> None:
2025-05-07T20:32:14.8821477Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8821724Z     
2025-05-07T20:32:14.8822001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8822356Z     
2025-05-07T20:32:14.8822559Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8822851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8823170Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8823418Z         x0 = x[:, :D]
2025-05-07T20:32:14.8823633Z         x1 = x[:, D:]
2025-05-07T20:32:14.8823856Z     
2025-05-07T20:32:14.8824049Z         if contiguous:
2025-05-07T20:32:14.8824280Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8824544Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8824792Z     
2025-05-07T20:32:14.8824986Z         if scale_ub is not None:
2025-05-07T20:32:14.8825265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8825608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8825923Z             )
2025-05-07T20:32:14.8826117Z         else:
2025-05-07T20:32:14.8826334Z             scale_ub_tensor = None
2025-05-07T20:32:14.8826591Z     
2025-05-07T20:32:14.8826937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8827258Z             op = silu_mul_quant
2025-05-07T20:32:14.8827517Z             if compiled:
2025-05-07T20:32:14.8827763Z                 op = torch.compile(op)
2025-05-07T20:32:14.8828069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8828435Z     
2025-05-07T20:32:14.8828631Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.8828807Z 
2025-05-07T20:32:14.8828909Z moe/activation_test.py:117: 
2025-05-07T20:32:14.8829222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8829557Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.8829846Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8830421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.8830996Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.8831668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.8832381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.8832937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8833642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8834326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8834876Z     kernel = self.compile(
2025-05-07T20:32:14.8835438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8836114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8836528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8836770Z 
2025-05-07T20:32:14.8836985Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae4d10>
2025-05-07T20:32:14.8838105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8839543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7d580>}
2025-05-07T20:32:14.8841023Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8842089Z context = <triton._C.libtriton.ir.context object at 0x7f94f7d5f070>
2025-05-07T20:32:14.8842384Z 
2025-05-07T20:32:14.8842563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8843106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8843582Z                            module_map=module_map)
2025-05-07T20:32:14.8843958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8844323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.8844584Z E       ^
2025-05-07T20:32:14.8845067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8845531Z 
2025-05-07T20:32:14.8845975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8846504Z 
2025-05-07T20:32:14.8846619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8847089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8847510Z     T=16384,
2025-05-07T20:32:14.8847716Z     D=7168,
2025-05-07T20:32:14.8847963Z     scale_ub=1200.0,
2025-05-07T20:32:14.8848195Z     contiguous=True,
2025-05-07T20:32:14.8848422Z     compiled=True,
2025-05-07T20:32:14.8848629Z )
2025-05-07T20:32:14.8848959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8849472Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.8849819Z 
2025-05-07T20:32:14.8849905Z     @given(
2025-05-07T20:32:14.8850134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8850454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8850777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8851107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8851445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8851741Z     )
2025-05-07T20:32:14.8852185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8852645Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8852894Z         self,
2025-05-07T20:32:14.8853091Z         T: int,
2025-05-07T20:32:14.8853297Z         D: int,
2025-05-07T20:32:14.8853523Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8853795Z         contiguous: bool,
2025-05-07T20:32:14.8854043Z         compiled: bool,
2025-05-07T20:32:14.8854278Z     ) -> None:
2025-05-07T20:32:14.8854500Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8854742Z     
2025-05-07T20:32:14.8855023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8855370Z     
2025-05-07T20:32:14.8855570Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8855869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8856187Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8856432Z         x0 = x[:, :D]
2025-05-07T20:32:14.8856658Z         x1 = x[:, D:]
2025-05-07T20:32:14.8856871Z     
2025-05-07T20:32:14.8857059Z         if contiguous:
2025-05-07T20:32:14.8857301Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8857569Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8857811Z     
2025-05-07T20:32:14.8858017Z         if scale_ub is not None:
2025-05-07T20:32:14.8858299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8858636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8858957Z             )
2025-05-07T20:32:14.8859158Z         else:
2025-05-07T20:32:14.8859369Z             scale_ub_tensor = None
2025-05-07T20:32:14.8859627Z     
2025-05-07T20:32:14.8859962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8860293Z             op = silu_mul_quant
2025-05-07T20:32:14.8860547Z             if compiled:
2025-05-07T20:32:14.8860804Z                 op = torch.compile(op)
2025-05-07T20:32:14.8861107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8861386Z     
2025-05-07T20:32:14.8861585Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.8861752Z 
2025-05-07T20:32:14.8861861Z moe/activation_test.py:117: 
2025-05-07T20:32:14.8862159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8862497Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.8862783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8863351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.8863928Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.8864612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.8865321Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.8865868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8866572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8867260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8867859Z     kernel = self.compile(
2025-05-07T20:32:14.8868408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8869090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8869548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8869784Z 
2025-05-07T20:32:14.8870002Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae7800>
2025-05-07T20:32:14.8871115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8872538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7e0c0>}
2025-05-07T20:32:14.8873940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8875008Z context = <triton._C.libtriton.ir.context object at 0x7f94f7dec530>
2025-05-07T20:32:14.8875304Z 
2025-05-07T20:32:14.8875475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8876022Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8876509Z                            module_map=module_map)
2025-05-07T20:32:14.8876931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8877289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.8877557Z E       ^
2025-05-07T20:32:14.8878040Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8878505Z 
2025-05-07T20:32:14.8878936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8879475Z 
2025-05-07T20:32:15.0097694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.0098371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.0099004Z     T=16384,
2025-05-07T20:32:15.0099272Z     D=5120,
2025-05-07T20:32:15.0099541Z     scale_ub=1200.0,
2025-05-07T20:32:15.0100152Z     contiguous=True,
2025-05-07T20:32:15.0100381Z     compiled=False,
2025-05-07T20:32:15.0100596Z )
2025-05-07T20:32:15.0100922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.0101435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.0101727Z 
2025-05-07T20:32:15.0101808Z     @given(
2025-05-07T20:32:15.0102055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.0102371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.0102685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.0103024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.0103364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.0103655Z     )
2025-05-07T20:32:15.0104014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.0104467Z     def test_silu_mul_quant(
2025-05-07T20:32:15.0104715Z         self,
2025-05-07T20:32:15.0104913Z         T: int,
2025-05-07T20:32:15.0105113Z         D: int,
2025-05-07T20:32:15.0105331Z         scale_ub: Optional[float],
2025-05-07T20:32:15.0105611Z         contiguous: bool,
2025-05-07T20:32:15.0105857Z         compiled: bool,
2025-05-07T20:32:15.0106084Z     ) -> None:
2025-05-07T20:32:15.0106567Z         torch.manual_seed(2025)
2025-05-07T20:32:15.0106856Z     
2025-05-07T20:32:15.0107233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.0107586Z     
2025-05-07T20:32:15.0107786Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.0108079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.0108400Z         x = x_sign * x_clamp
2025-05-07T20:32:15.0108725Z         x0 = x[:, :D]
2025-05-07T20:32:15.0108958Z         x1 = x[:, D:]
2025-05-07T20:32:15.0109171Z     
2025-05-07T20:32:15.0109355Z         if contiguous:
2025-05-07T20:32:15.0109591Z             x0 = x0.contiguous()
2025-05-07T20:32:15.0109864Z             x1 = x1.contiguous()
2025-05-07T20:32:15.0110102Z     
2025-05-07T20:32:15.0110299Z         if scale_ub is not None:
2025-05-07T20:32:15.0110578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.0110917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.0111234Z             )
2025-05-07T20:32:15.0111432Z         else:
2025-05-07T20:32:15.0111645Z             scale_ub_tensor = None
2025-05-07T20:32:15.0111905Z     
2025-05-07T20:32:15.0112143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.0112468Z             op = silu_mul_quant
2025-05-07T20:32:15.0112717Z             if compiled:
2025-05-07T20:32:15.0112969Z                 op = torch.compile(op)
2025-05-07T20:32:15.0113274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.0113549Z     
2025-05-07T20:32:15.0113743Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.0113912Z 
2025-05-07T20:32:15.0114095Z moe/activation_test.py:117: 
2025-05-07T20:32:15.0114672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.0124170Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.0124592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.0125323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.0126051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.0126604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.0127304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.0127998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.0128551Z     kernel = self.compile(
2025-05-07T20:32:15.0129291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.0129980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.0130397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.0130636Z 
2025-05-07T20:32:15.0130848Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae6f30>
2025-05-07T20:32:15.0132050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.0133487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7f1a0>}
2025-05-07T20:32:15.0134888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.0135951Z context = <triton._C.libtriton.ir.context object at 0x7f94f7dc6870>
2025-05-07T20:32:15.0136248Z 
2025-05-07T20:32:15.0136421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.0136962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.0137499Z                            module_map=module_map)
2025-05-07T20:32:15.0137869Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.0138234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.0138502Z E       ^
2025-05-07T20:32:15.0138986Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.0139498Z 
2025-05-07T20:32:15.0139932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.0140474Z 
2025-05-07T20:32:15.0140579Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.0141008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.0141416Z     T=1,
2025-05-07T20:32:15.0141605Z     D=7168,
2025-05-07T20:32:15.0141804Z     scale_ub=1200.0,
2025-05-07T20:32:15.0142026Z     contiguous=False,
2025-05-07T20:32:15.0142255Z     compiled=False,
2025-05-07T20:32:15.0142467Z )
2025-05-07T20:32:15.0142792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.0143294Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.0143576Z 
2025-05-07T20:32:15.0143656Z     @given(
2025-05-07T20:32:15.0143892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.0144210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.0144525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.0144870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.0145200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.0145496Z     )
2025-05-07T20:32:15.0145855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.0146313Z     def test_silu_mul_quant(
2025-05-07T20:32:15.0146559Z         self,
2025-05-07T20:32:15.0146786Z         T: int,
2025-05-07T20:32:15.0147019Z         D: int,
2025-05-07T20:32:15.0147238Z         scale_ub: Optional[float],
2025-05-07T20:32:15.0147517Z         contiguous: bool,
2025-05-07T20:32:15.0147763Z         compiled: bool,
2025-05-07T20:32:15.0147988Z     ) -> None:
2025-05-07T20:32:15.0148210Z         torch.manual_seed(2025)
2025-05-07T20:32:15.0148459Z     
2025-05-07T20:32:15.0148737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.0149090Z     
2025-05-07T20:32:15.0149290Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.0149673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.0149995Z         x = x_sign * x_clamp
2025-05-07T20:32:15.0150244Z         x0 = x[:, :D]
2025-05-07T20:32:15.0150462Z         x1 = x[:, D:]
2025-05-07T20:32:15.0150678Z     
2025-05-07T20:32:15.0150868Z         if contiguous:
2025-05-07T20:32:15.0151102Z             x0 = x0.contiguous()
2025-05-07T20:32:15.0151376Z             x1 = x1.contiguous()
2025-05-07T20:32:15.0151625Z     
2025-05-07T20:32:15.0151827Z         if scale_ub is not None:
2025-05-07T20:32:15.0152100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.0152438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.0152752Z             )
2025-05-07T20:32:15.0152947Z         else:
2025-05-07T20:32:15.0153164Z             scale_ub_tensor = None
2025-05-07T20:32:15.0153425Z     
2025-05-07T20:32:15.0153657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.0153983Z             op = silu_mul_quant
2025-05-07T20:32:15.0154239Z             if compiled:
2025-05-07T20:32:15.0154490Z                 op = torch.compile(op)
2025-05-07T20:32:15.0154794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.0155074Z     
2025-05-07T20:32:15.0155266Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.0155438Z 
2025-05-07T20:32:15.0155539Z moe/activation_test.py:117: 
2025-05-07T20:32:15.0155848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.0156241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.0156524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.0157286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.0158040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.0158588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.0159298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.0159986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.0160540Z     kernel = self.compile(
2025-05-07T20:32:15.0161092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.0161772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.0162184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.0162420Z 
2025-05-07T20:32:15.0162638Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639cc82f0>
2025-05-07T20:32:15.0163749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.0165182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc0680>}
2025-05-07T20:32:15.0166580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.0167696Z context = <triton._C.libtriton.ir.context object at 0x7f9638bc8630>
2025-05-07T20:32:15.0167993Z 
2025-05-07T20:32:15.0168182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.0168715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.0169201Z                            module_map=module_map)
2025-05-07T20:32:15.0169576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.0169934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.0170201Z E       ^
2025-05-07T20:32:15.0170764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.0171235Z 
2025-05-07T20:32:15.0171674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.0172287Z 
2025-05-07T20:32:15.1906056Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1907136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1907698Z     T=4096,
2025-05-07T20:32:15.1907892Z     D=7168,
2025-05-07T20:32:15.1908077Z     scale_ub=1200.0,
2025-05-07T20:32:15.1908305Z     contiguous=False,
2025-05-07T20:32:15.1908535Z     compiled=True,
2025-05-07T20:32:15.1908753Z )
2025-05-07T20:32:15.1909084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1909603Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.1909900Z 
2025-05-07T20:32:15.1909989Z     @given(
2025-05-07T20:32:15.1910222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1910545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1910860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1911195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1911533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1912120Z     )
2025-05-07T20:32:15.1912475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1912932Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1913182Z         self,
2025-05-07T20:32:15.1913382Z         T: int,
2025-05-07T20:32:15.1913680Z         D: int,
2025-05-07T20:32:15.1913906Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1914184Z         contiguous: bool,
2025-05-07T20:32:15.1914424Z         compiled: bool,
2025-05-07T20:32:15.1914659Z     ) -> None:
2025-05-07T20:32:15.1914885Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1915128Z     
2025-05-07T20:32:15.1915409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1915761Z     
2025-05-07T20:32:15.1915955Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1916252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1916571Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1916816Z         x0 = x[:, :D]
2025-05-07T20:32:15.1917043Z         x1 = x[:, D:]
2025-05-07T20:32:15.1917256Z     
2025-05-07T20:32:15.1917440Z         if contiguous:
2025-05-07T20:32:15.1917680Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1917945Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1918189Z     
2025-05-07T20:32:15.1918391Z         if scale_ub is not None:
2025-05-07T20:32:15.1918670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1919015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1919330Z             )
2025-05-07T20:32:15.1919530Z         else:
2025-05-07T20:32:15.1919748Z             scale_ub_tensor = None
2025-05-07T20:32:15.1920000Z     
2025-05-07T20:32:15.1920241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1920565Z             op = silu_mul_quant
2025-05-07T20:32:15.1920816Z             if compiled:
2025-05-07T20:32:15.1921073Z                 op = torch.compile(op)
2025-05-07T20:32:15.1921382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1921657Z     
2025-05-07T20:32:15.1921858Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1922024Z 
2025-05-07T20:32:15.1922134Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1922436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1922788Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1923084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1923821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1924399Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1925078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1925792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1926341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1927045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1927730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1928278Z     kernel = self.compile(
2025-05-07T20:32:15.1928858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1929532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1929951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1930190Z 
2025-05-07T20:32:15.1930416Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce184dd0>
2025-05-07T20:32:15.1931543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1933203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc1940>}
2025-05-07T20:32:15.1934626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1935764Z context = <triton._C.libtriton.ir.context object at 0x7f9638be6670>
2025-05-07T20:32:15.1936064Z 
2025-05-07T20:32:15.1936243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1936834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1937329Z                            module_map=module_map)
2025-05-07T20:32:15.1937707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1938079Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.1938345Z E       ^
2025-05-07T20:32:15.1938831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1939302Z 
2025-05-07T20:32:15.1939746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.1940288Z 
2025-05-07T20:32:15.1940402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1940839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1941261Z     T=128,
2025-05-07T20:32:15.1941454Z     D=7168,
2025-05-07T20:32:15.1941645Z     scale_ub=1200.0,
2025-05-07T20:32:15.1941878Z     contiguous=False,
2025-05-07T20:32:15.1942115Z     compiled=True,
2025-05-07T20:32:15.1942317Z )
2025-05-07T20:32:15.2850721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2852329Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.2853103Z 
2025-05-07T20:32:15.2853323Z     @given(
2025-05-07T20:32:15.2853783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2854421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2855053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2855711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2856376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2857269Z     )
2025-05-07T20:32:15.2857637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2858098Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2858348Z         self,
2025-05-07T20:32:15.2858579Z         T: int,
2025-05-07T20:32:15.2858787Z         D: int,
2025-05-07T20:32:15.2859012Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2859293Z         contiguous: bool,
2025-05-07T20:32:15.2859546Z         compiled: bool,
2025-05-07T20:32:15.2859777Z     ) -> None:
2025-05-07T20:32:15.2859999Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2860250Z     
2025-05-07T20:32:15.2860527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2860882Z     
2025-05-07T20:32:15.2861088Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2861387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2861709Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2861959Z         x0 = x[:, :D]
2025-05-07T20:32:15.2862188Z         x1 = x[:, D:]
2025-05-07T20:32:15.2862396Z     
2025-05-07T20:32:15.2862588Z         if contiguous:
2025-05-07T20:32:15.2862828Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2863090Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2863338Z     
2025-05-07T20:32:15.2863536Z         if scale_ub is not None:
2025-05-07T20:32:15.2863810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2864276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2864601Z             )
2025-05-07T20:32:15.2864795Z         else:
2025-05-07T20:32:15.2865015Z             scale_ub_tensor = None
2025-05-07T20:32:15.2865275Z     
2025-05-07T20:32:15.2865510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2865920Z             op = silu_mul_quant
2025-05-07T20:32:15.2866178Z             if compiled:
2025-05-07T20:32:15.2866425Z                 op = torch.compile(op)
2025-05-07T20:32:15.2866763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2867074Z     
2025-05-07T20:32:15.2867272Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2867442Z 
2025-05-07T20:32:15.2867544Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2867849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2868190Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2868474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2869064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.2869649Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.2870329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2871054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2871616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2872344Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2873038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2873599Z     kernel = self.compile(
2025-05-07T20:32:15.2874165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2874856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2875266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2875511Z 
2025-05-07T20:32:15.2875726Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce185ee0>
2025-05-07T20:32:15.2876956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2878434Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc2700>}
2025-05-07T20:32:15.2879845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2880928Z context = <triton._C.libtriton.ir.context object at 0x7f94f7be14b0>
2025-05-07T20:32:15.2881234Z 
2025-05-07T20:32:15.2881407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2881956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2882442Z                            module_map=module_map)
2025-05-07T20:32:15.2882824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2883201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2883464Z E       ^
2025-05-07T20:32:15.2883958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2884439Z 
2025-05-07T20:32:15.2884883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2885470Z 
2025-05-07T20:32:15.2885585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2886011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2886438Z     T=2048,
2025-05-07T20:32:15.2886635Z     D=7168,
2025-05-07T20:32:15.2886829Z     scale_ub=None,
2025-05-07T20:32:15.2887097Z     contiguous=True,
2025-05-07T20:32:15.2887330Z     compiled=True,
2025-05-07T20:32:15.2887544Z )
2025-05-07T20:32:15.2887871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2888400Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.2888680Z 
2025-05-07T20:32:15.2888766Z     @given(
2025-05-07T20:32:15.2889000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2889330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2889651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2889985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2890329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2890623Z     )
2025-05-07T20:32:15.2890987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2891444Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2891694Z         self,
2025-05-07T20:32:15.2891972Z         T: int,
2025-05-07T20:32:15.2892173Z         D: int,
2025-05-07T20:32:15.2892402Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2892685Z         contiguous: bool,
2025-05-07T20:32:15.2892934Z         compiled: bool,
2025-05-07T20:32:15.2893165Z     ) -> None:
2025-05-07T20:32:15.2893389Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2893636Z     
2025-05-07T20:32:15.2893918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2894277Z     
2025-05-07T20:32:15.2894473Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2894774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2895099Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2895342Z         x0 = x[:, :D]
2025-05-07T20:32:15.2895566Z         x1 = x[:, D:]
2025-05-07T20:32:15.2895782Z     
2025-05-07T20:32:15.2895973Z         if contiguous:
2025-05-07T20:32:15.2896211Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2896486Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2896737Z     
2025-05-07T20:32:15.2896955Z         if scale_ub is not None:
2025-05-07T20:32:15.2897259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2897702Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2898017Z             )
2025-05-07T20:32:15.2898220Z         else:
2025-05-07T20:32:15.2898441Z             scale_ub_tensor = None
2025-05-07T20:32:15.2898696Z     
2025-05-07T20:32:15.2898937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2899267Z             op = silu_mul_quant
2025-05-07T20:32:15.2899527Z             if compiled:
2025-05-07T20:32:15.2899792Z                 op = torch.compile(op)
2025-05-07T20:32:15.2900101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2900381Z     
2025-05-07T20:32:15.2900582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2900752Z 
2025-05-07T20:32:15.2900864Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2901185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2901532Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2901828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2902418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.2903001Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.2903698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2904423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2905039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2905748Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2906728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2907390Z     kernel = self.compile(
2025-05-07T20:32:15.2907952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2908652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2909065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2909304Z 
2025-05-07T20:32:15.2909525Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce184e00>
2025-05-07T20:32:15.2910660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2912111Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc37e0>}
2025-05-07T20:32:15.2913534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2914624Z context = <triton._C.libtriton.ir.context object at 0x7f94f7b20b70>
2025-05-07T20:32:15.2914926Z 
2025-05-07T20:32:15.2915103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2915648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2916138Z                            module_map=module_map)
2025-05-07T20:32:15.2916524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2916887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2917150Z E       ^
2025-05-07T20:32:15.2917632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2918108Z 
2025-05-07T20:32:15.2918549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2919080Z 
2025-05-07T20:32:15.3578777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3579449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3580014Z     T=16384,
2025-05-07T20:32:15.3580290Z     D=5120,
2025-05-07T20:32:15.3580505Z     scale_ub=None,
2025-05-07T20:32:15.3580726Z     contiguous=False,
2025-05-07T20:32:15.3580973Z     compiled=False,
2025-05-07T20:32:15.3581176Z )
2025-05-07T20:32:15.3581514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3582035Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.3582323Z 
2025-05-07T20:32:15.3582409Z     @given(
2025-05-07T20:32:15.3582639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3582969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3583288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3583620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3583962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3584257Z     )
2025-05-07T20:32:15.3584611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3585069Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3585318Z         self,
2025-05-07T20:32:15.3585517Z         T: int,
2025-05-07T20:32:15.3585711Z         D: int,
2025-05-07T20:32:15.3586021Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3596404Z         contiguous: bool,
2025-05-07T20:32:15.3596794Z         compiled: bool,
2025-05-07T20:32:15.3597040Z     ) -> None:
2025-05-07T20:32:15.3597265Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3597511Z     
2025-05-07T20:32:15.3597793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3598300Z     
2025-05-07T20:32:15.3598493Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3598794Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3600911Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.3602891Z 
2025-05-07T20:32:15.3603015Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.3603241Z 
2025-05-07T20:32:15.3603347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3603775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3604187Z     T=4096,
2025-05-07T20:32:15.3604380Z     D=7168,
2025-05-07T20:32:15.3604575Z     scale_ub=1200.0,
2025-05-07T20:32:15.3604806Z     contiguous=True,
2025-05-07T20:32:15.3605030Z     compiled=True,
2025-05-07T20:32:15.3605240Z )
2025-05-07T20:32:15.3605561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3606072Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.3606660Z 
2025-05-07T20:32:15.3606761Z     @given(
2025-05-07T20:32:15.3607026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3607338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3607652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3607987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3608315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3608608Z     )
2025-05-07T20:32:15.3608965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3609411Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3609656Z         self,
2025-05-07T20:32:15.3609991Z         T: int,
2025-05-07T20:32:15.3610190Z         D: int,
2025-05-07T20:32:15.3610411Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3610685Z         contiguous: bool,
2025-05-07T20:32:15.3610928Z         compiled: bool,
2025-05-07T20:32:15.3611149Z     ) -> None:
2025-05-07T20:32:15.3611370Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3611617Z     
2025-05-07T20:32:15.3611972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3612323Z     
2025-05-07T20:32:15.3612519Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3612808Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3614902Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.3616891Z 
2025-05-07T20:32:15.3617035Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.3617255Z 
2025-05-07T20:32:15.3617434Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3617855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3618263Z     T=16384,
2025-05-07T20:32:15.3618458Z     D=7168,
2025-05-07T20:32:15.3618652Z     scale_ub=None,
2025-05-07T20:32:15.3618865Z     contiguous=False,
2025-05-07T20:32:15.3619162Z     compiled=False,
2025-05-07T20:32:15.3619369Z )
2025-05-07T20:32:15.3619689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3620210Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.3620496Z 
2025-05-07T20:32:15.3620585Z     @given(
2025-05-07T20:32:15.3620811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3621132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3621444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3621783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3622116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3622410Z     )
2025-05-07T20:32:15.3622787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3623246Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3623486Z         self,
2025-05-07T20:32:15.3623686Z         T: int,
2025-05-07T20:32:15.3623891Z         D: int,
2025-05-07T20:32:15.3624105Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3624377Z         contiguous: bool,
2025-05-07T20:32:15.3624622Z         compiled: bool,
2025-05-07T20:32:15.3624847Z     ) -> None:
2025-05-07T20:32:15.3625064Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3625307Z     
2025-05-07T20:32:15.3625576Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3627712Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.3629670Z 
2025-05-07T20:32:15.3629790Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.3630008Z 
2025-05-07T20:32:15.3630111Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3630623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3631035Z     T=2048,
2025-05-07T20:32:15.3631224Z     D=7168,
2025-05-07T20:32:15.3631420Z     scale_ub=1200.0,
2025-05-07T20:32:15.3631639Z     contiguous=True,
2025-05-07T20:32:15.3631863Z     compiled=True,
2025-05-07T20:32:15.3632069Z )
2025-05-07T20:32:15.3632391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.3632909Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.3633189Z 
2025-05-07T20:32:15.3633274Z     @given(
2025-05-07T20:32:15.3633502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.3633820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.3634136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.3634475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.3634811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.3635103Z     )
2025-05-07T20:32:15.3635460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.3635908Z     def test_silu_mul_quant(
2025-05-07T20:32:15.3636156Z         self,
2025-05-07T20:32:15.3636354Z         T: int,
2025-05-07T20:32:15.3636550Z         D: int,
2025-05-07T20:32:15.3636772Z         scale_ub: Optional[float],
2025-05-07T20:32:15.3637102Z         contiguous: bool,
2025-05-07T20:32:15.3637341Z         compiled: bool,
2025-05-07T20:32:15.3637567Z     ) -> None:
2025-05-07T20:32:15.3637786Z         torch.manual_seed(2025)
2025-05-07T20:32:15.3638031Z     
2025-05-07T20:32:15.3638308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.3638704Z     
2025-05-07T20:32:15.3638901Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.3639193Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.3641275Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.3643215Z 
2025-05-07T20:32:15.3643336Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:15.3643551Z 
2025-05-07T20:32:15.3643660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.3644082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.3644499Z     T=2048,
2025-05-07T20:32:15.3644693Z     D=7168,
2025-05-07T20:32:15.3644893Z     scale_ub=None,
2025-05-07T20:32:15.3645105Z     contiguous=True,
2025-05-07T20:32:15.3645338Z     compiled=False,
2025-05-07T20:32:15.3645545Z )
2025-05-07T20:32:15.4758749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4759523Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.4759915Z 
2025-05-07T20:32:15.4760022Z     @given(
2025-05-07T20:32:15.4760315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4760639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4760951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4761292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4761628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4761917Z     )
2025-05-07T20:32:15.4762284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4762737Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4762977Z         self,
2025-05-07T20:32:15.4763397Z         T: int,
2025-05-07T20:32:15.4763603Z         D: int,
2025-05-07T20:32:15.4763822Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4764096Z         contiguous: bool,
2025-05-07T20:32:15.4764343Z         compiled: bool,
2025-05-07T20:32:15.4764571Z     ) -> None:
2025-05-07T20:32:15.4764793Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4765043Z     
2025-05-07T20:32:15.4765317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4765671Z     
2025-05-07T20:32:15.4765871Z >       x_sign = torch.sign(x)
2025-05-07T20:32:15.4767914Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.4769861Z 
2025-05-07T20:32:15.4769993Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:15.4770213Z 
2025-05-07T20:32:15.4770320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4770755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4771252Z     T=1,
2025-05-07T20:32:15.4771436Z     D=7168,
2025-05-07T20:32:15.4771642Z     scale_ub=1200.0,
2025-05-07T20:32:15.4771962Z     contiguous=True,
2025-05-07T20:32:15.4772186Z     compiled=False,
2025-05-07T20:32:15.4772400Z )
2025-05-07T20:32:15.4772730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4773325Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.4773598Z 
2025-05-07T20:32:15.4773678Z     @given(
2025-05-07T20:32:15.4773920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4774245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4774573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4774919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4775263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4775553Z     )
2025-05-07T20:32:15.4775919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4776378Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4776623Z         self,
2025-05-07T20:32:15.4776826Z         T: int,
2025-05-07T20:32:15.4777034Z         D: int,
2025-05-07T20:32:15.4777261Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4777540Z         contiguous: bool,
2025-05-07T20:32:15.4777789Z         compiled: bool,
2025-05-07T20:32:15.4778027Z     ) -> None:
2025-05-07T20:32:15.4778246Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4778500Z     
2025-05-07T20:32:15.4778791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4779143Z     
2025-05-07T20:32:15.4779344Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4779648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4779966Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4780214Z         x0 = x[:, :D]
2025-05-07T20:32:15.4780440Z         x1 = x[:, D:]
2025-05-07T20:32:15.4780652Z     
2025-05-07T20:32:15.4780844Z         if contiguous:
2025-05-07T20:32:15.4781092Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4781362Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4781614Z     
2025-05-07T20:32:15.4781813Z         if scale_ub is not None:
2025-05-07T20:32:15.4782090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4782441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4782765Z             )
2025-05-07T20:32:15.4782968Z         else:
2025-05-07T20:32:15.4783303Z             scale_ub_tensor = None
2025-05-07T20:32:15.4783567Z     
2025-05-07T20:32:15.4783809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4784131Z             op = silu_mul_quant
2025-05-07T20:32:15.4784390Z             if compiled:
2025-05-07T20:32:15.4784646Z                 op = torch.compile(op)
2025-05-07T20:32:15.4784945Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4785236Z     
2025-05-07T20:32:15.4785437Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4785606Z 
2025-05-07T20:32:15.4785708Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4786019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4786366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4786667Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4787394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4788132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4788707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4789425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4790130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4790748Z     kernel = self.compile(
2025-05-07T20:32:15.4791320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4792008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4792433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4792716Z 
2025-05-07T20:32:15.4792936Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638378c80>
2025-05-07T20:32:15.4794074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4795501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7b3ab60>}
2025-05-07T20:32:15.4796912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4798005Z context = <triton._C.libtriton.ir.context object at 0x7f94f7c806f0>
2025-05-07T20:32:15.4798311Z 
2025-05-07T20:32:15.4798492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4799039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4799541Z                            module_map=module_map)
2025-05-07T20:32:15.4799924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4800295Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4800561Z E       ^
2025-05-07T20:32:15.4801054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4801534Z 
2025-05-07T20:32:15.4801980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4802526Z 
2025-05-07T20:32:15.4802634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4803068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4803497Z     T=128,
2025-05-07T20:32:15.4803692Z     D=5120,
2025-05-07T20:32:15.4803886Z     scale_ub=None,
2025-05-07T20:32:15.4804114Z     contiguous=True,
2025-05-07T20:32:15.4804348Z     compiled=False,
2025-05-07T20:32:15.4804639Z )
2025-05-07T20:32:15.5479942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5480714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.5481060Z 
2025-05-07T20:32:15.5481137Z     @given(
2025-05-07T20:32:15.5481374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5481711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5482024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5482358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5482685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5482975Z     )
2025-05-07T20:32:15.5483329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5483786Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5484028Z         self,
2025-05-07T20:32:15.5484225Z         T: int,
2025-05-07T20:32:15.5484435Z         D: int,
2025-05-07T20:32:15.5484655Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5484931Z         contiguous: bool,
2025-05-07T20:32:15.5485172Z         compiled: bool,
2025-05-07T20:32:15.5485396Z     ) -> None:
2025-05-07T20:32:15.5485621Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5485865Z     
2025-05-07T20:32:15.5486136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5486734Z     
2025-05-07T20:32:15.5486931Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.5487223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.5487541Z         x = x_sign * x_clamp
2025-05-07T20:32:15.5487783Z         x0 = x[:, :D]
2025-05-07T20:32:15.5487994Z         x1 = x[:, D:]
2025-05-07T20:32:15.5488299Z     
2025-05-07T20:32:15.5488485Z         if contiguous:
2025-05-07T20:32:15.5488713Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5488979Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5489223Z     
2025-05-07T20:32:15.5489424Z         if scale_ub is not None:
2025-05-07T20:32:15.5489699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5490044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5490366Z             )
2025-05-07T20:32:15.5490560Z         else:
2025-05-07T20:32:15.5490778Z             scale_ub_tensor = None
2025-05-07T20:32:15.5491036Z     
2025-05-07T20:32:15.5491270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5491597Z             op = silu_mul_quant
2025-05-07T20:32:15.5491962Z             if compiled:
2025-05-07T20:32:15.5492212Z                 op = torch.compile(op)
2025-05-07T20:32:15.5492518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5492805Z     
2025-05-07T20:32:15.5492999Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.5493173Z 
2025-05-07T20:32:15.5493274Z moe/activation_test.py:117: 
2025-05-07T20:32:15.5493589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5493935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.5494224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5494945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.5495664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.5496222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5496938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5497631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5498188Z     kernel = self.compile(
2025-05-07T20:32:15.5498744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5499583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5500003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5500241Z 
2025-05-07T20:32:15.5500462Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638379370>
2025-05-07T20:32:15.5501582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5503029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7b3bc40>}
2025-05-07T20:32:15.5504438Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5505509Z context = <triton._C.libtriton.ir.context object at 0x7f94f7af75b0>
2025-05-07T20:32:15.5505808Z 
2025-05-07T20:32:15.5505982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5506896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5507389Z                            module_map=module_map)
2025-05-07T20:32:15.5507859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5508221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.5508494Z E       ^
2025-05-07T20:32:15.5508982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5509518Z 
2025-05-07T20:32:15.5509956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5510499Z 
2025-05-07T20:32:15.5510607Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5511046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5511472Z     T=128,
2025-05-07T20:32:15.5511666Z     D=7168,
2025-05-07T20:32:15.5511867Z     scale_ub=None,
2025-05-07T20:32:15.5512097Z     contiguous=True,
2025-05-07T20:32:15.5512326Z     compiled=False,
2025-05-07T20:32:15.5512548Z )
2025-05-07T20:32:15.5512884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5513399Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.5513685Z 
2025-05-07T20:32:15.5513766Z     @given(
2025-05-07T20:32:15.5514006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5514338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5514654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5514999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5515348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5515636Z     )
2025-05-07T20:32:15.5516003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5516462Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5516707Z         self,
2025-05-07T20:32:15.5516910Z         T: int,
2025-05-07T20:32:15.5517115Z         D: int,
2025-05-07T20:32:15.5517336Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5517621Z         contiguous: bool,
2025-05-07T20:32:15.5517869Z         compiled: bool,
2025-05-07T20:32:15.5518094Z     ) -> None:
2025-05-07T20:32:15.5518320Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5518570Z     
2025-05-07T20:32:15.5518858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5519211Z     
2025-05-07T20:32:15.5519410Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.5519715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.5520030Z         x = x_sign * x_clamp
2025-05-07T20:32:15.5520406Z         x0 = x[:, :D]
2025-05-07T20:32:15.5520635Z         x1 = x[:, D:]
2025-05-07T20:32:15.5520847Z     
2025-05-07T20:32:15.5521039Z         if contiguous:
2025-05-07T20:32:15.5521281Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5521544Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5521793Z     
2025-05-07T20:32:15.5522011Z         if scale_ub is not None:
2025-05-07T20:32:15.5522292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5522642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5522967Z             )
2025-05-07T20:32:15.5523161Z         else:
2025-05-07T20:32:15.5523381Z             scale_ub_tensor = None
2025-05-07T20:32:15.5523642Z     
2025-05-07T20:32:15.5523881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5524210Z             op = silu_mul_quant
2025-05-07T20:32:15.5524472Z             if compiled:
2025-05-07T20:32:15.5524730Z                 op = torch.compile(op)
2025-05-07T20:32:15.5525039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5525325Z     
2025-05-07T20:32:15.5525526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.5525697Z 
2025-05-07T20:32:15.5525801Z moe/activation_test.py:117: 
2025-05-07T20:32:15.5526113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5526462Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.5526801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5527522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.5528239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.5528799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5541337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5542137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5542692Z     kernel = self.compile(
2025-05-07T20:32:15.5543250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5543930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5544333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5544574Z 
2025-05-07T20:32:15.5544788Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7af1c10>
2025-05-07T20:32:15.5545908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5547338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7adcae0>}
2025-05-07T20:32:15.5548729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5549913Z context = <triton._C.libtriton.ir.context object at 0x7f94f7ac4bb0>
2025-05-07T20:32:15.5550215Z 
2025-05-07T20:32:15.5550388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5550938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5551419Z                            module_map=module_map)
2025-05-07T20:32:15.5551801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5552163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.5552425Z E       ^
2025-05-07T20:32:15.5553036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5553504Z 
2025-05-07T20:32:15.5553936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5554465Z 
2025-05-07T20:32:15.5554577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5555013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5555428Z     T=2048,
2025-05-07T20:32:15.5555622Z     D=7168,
2025-05-07T20:32:15.5555820Z     scale_ub=1200.0,
2025-05-07T20:32:15.5556045Z     contiguous=True,
2025-05-07T20:32:15.5556273Z     compiled=False,
2025-05-07T20:32:15.5556487Z )
2025-05-07T20:32:15.6356228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6357154Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.6357637Z 
2025-05-07T20:32:15.6357758Z     @given(
2025-05-07T20:32:15.6358128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6358630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6359122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6359661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6360198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6360677Z     )
2025-05-07T20:32:15.6361580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6362328Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6362720Z         self,
2025-05-07T20:32:15.6363029Z         T: int,
2025-05-07T20:32:15.6363336Z         D: int,
2025-05-07T20:32:15.6363687Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6364245Z         contiguous: bool,
2025-05-07T20:32:15.6364580Z         compiled: bool,
2025-05-07T20:32:15.6364903Z     ) -> None:
2025-05-07T20:32:15.6365222Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6365602Z     
2025-05-07T20:32:15.6366037Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6369679Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6373439Z 
2025-05-07T20:32:15.6373650Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.6374038Z 
2025-05-07T20:32:15.6374206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6374937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6375630Z     T=1,
2025-05-07T20:32:15.6375925Z     D=5120,
2025-05-07T20:32:15.6376224Z     scale_ub=1200.0,
2025-05-07T20:32:15.6376553Z     contiguous=True,
2025-05-07T20:32:15.6376925Z     compiled=False,
2025-05-07T20:32:15.6377263Z )
2025-05-07T20:32:15.6377789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6378591Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.6379034Z 
2025-05-07T20:32:15.6379158Z     @given(
2025-05-07T20:32:15.6379522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6380010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6380512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6381058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6381593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6382057Z     )
2025-05-07T20:32:15.6382885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6383642Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6384014Z         self,
2025-05-07T20:32:15.6384323Z         T: int,
2025-05-07T20:32:15.6384639Z         D: int,
2025-05-07T20:32:15.6384987Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6385450Z         contiguous: bool,
2025-05-07T20:32:15.6385832Z         compiled: bool,
2025-05-07T20:32:15.6386183Z     ) -> None:
2025-05-07T20:32:15.6386507Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6386932Z     
2025-05-07T20:32:15.6387404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6387997Z     
2025-05-07T20:32:15.6388308Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.6388788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.6389333Z         x = x_sign * x_clamp
2025-05-07T20:32:15.6389730Z         x0 = x[:, :D]
2025-05-07T20:32:15.6390073Z         x1 = x[:, D:]
2025-05-07T20:32:15.6390416Z     
2025-05-07T20:32:15.6390723Z         if contiguous:
2025-05-07T20:32:15.6391099Z             x0 = x0.contiguous()
2025-05-07T20:32:15.6391532Z             x1 = x1.contiguous()
2025-05-07T20:32:15.6391935Z     
2025-05-07T20:32:15.6392245Z         if scale_ub is not None:
2025-05-07T20:32:15.6392696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.6393273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.6393893Z             )
2025-05-07T20:32:15.6394198Z         else:
2025-05-07T20:32:15.6394538Z             scale_ub_tensor = None
2025-05-07T20:32:15.6394967Z     
2025-05-07T20:32:15.6395340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.6395884Z             op = silu_mul_quant
2025-05-07T20:32:15.6396371Z             if compiled:
2025-05-07T20:32:15.6396778Z                 op = torch.compile(op)
2025-05-07T20:32:15.6397280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6397752Z     
2025-05-07T20:32:15.6398055Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.6398332Z 
2025-05-07T20:32:15.6398486Z moe/activation_test.py:117: 
2025-05-07T20:32:15.6398963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6399506Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.6399958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6401200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.6402499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.6403467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.6404733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.6405964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.6407144Z     kernel = self.compile(
2025-05-07T20:32:15.6408128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.6409352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.6410060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6410476Z 
2025-05-07T20:32:15.6410847Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7af1760>
2025-05-07T20:32:15.6412991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.6415646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7ade0c0>}
2025-05-07T20:32:15.6418474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.6420274Z context = <triton._C.libtriton.ir.context object at 0x7f94f7aefe30>
2025-05-07T20:32:15.6420695Z 
2025-05-07T20:32:15.6420943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.6421748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.6422480Z                            module_map=module_map)
2025-05-07T20:32:15.6423006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.6423501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.6423869Z E       ^
2025-05-07T20:32:15.6424549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.6425219Z 
2025-05-07T20:32:15.6425851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.6426627Z 
2025-05-07T20:32:15.6426769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6427421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6428011Z     T=2048,
2025-05-07T20:32:15.6428261Z     D=5120,
2025-05-07T20:32:15.6428656Z     scale_ub=None,
2025-05-07T20:32:15.6428957Z     contiguous=True,
2025-05-07T20:32:15.6429266Z     compiled=False,
2025-05-07T20:32:15.6429559Z )
2025-05-07T20:32:15.6430016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6430723Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.6431226Z 
2025-05-07T20:32:15.6431339Z     @given(
2025-05-07T20:32:15.6431653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6432098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6432528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6432996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6433460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6433870Z     )
2025-05-07T20:32:15.6434372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6435017Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6435346Z         self,
2025-05-07T20:32:15.6435629Z         T: int,
2025-05-07T20:32:15.6435903Z         D: int,
2025-05-07T20:32:15.6436195Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6436573Z         contiguous: bool,
2025-05-07T20:32:15.6436903Z         compiled: bool,
2025-05-07T20:32:15.6437207Z     ) -> None:
2025-05-07T20:32:15.6437507Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6437851Z     
2025-05-07T20:32:15.6438219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6438702Z     
2025-05-07T20:32:15.6438973Z >       x_sign = torch.sign(x)
2025-05-07T20:32:15.6441912Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.6444739Z 
2025-05-07T20:32:15.6444926Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:15.6445269Z 
2025-05-07T20:32:15.6445420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6446076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6446740Z     T=16384,
2025-05-07T20:32:15.6447217Z     D=5120,
2025-05-07T20:32:15.6447535Z     scale_ub=None,
2025-05-07T20:32:15.6447883Z     contiguous=True,
2025-05-07T20:32:15.6448229Z     compiled=False,
2025-05-07T20:32:15.6448544Z )
2025-05-07T20:32:15.7182797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7183703Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.7184211Z 
2025-05-07T20:32:15.7184327Z     @given(
2025-05-07T20:32:15.7184683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7185186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7185678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7186217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7186754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7187283Z     )
2025-05-07T20:32:15.7187877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7188640Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7189025Z         self,
2025-05-07T20:32:15.7189332Z         T: int,
2025-05-07T20:32:15.7189669Z         D: int,
2025-05-07T20:32:15.7190010Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7190450Z         contiguous: bool,
2025-05-07T20:32:15.7190827Z         compiled: bool,
2025-05-07T20:32:15.7191182Z     ) -> None:
2025-05-07T20:32:15.7191731Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7192117Z     
2025-05-07T20:32:15.7192555Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7196237Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7200050Z 
2025-05-07T20:32:15.7200255Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7200639Z 
2025-05-07T20:32:15.7200807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7201528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7202205Z     T=4096,
2025-05-07T20:32:15.7202500Z     D=5120,
2025-05-07T20:32:15.7202801Z     scale_ub=None,
2025-05-07T20:32:15.7203140Z     contiguous=True,
2025-05-07T20:32:15.7203499Z     compiled=False,
2025-05-07T20:32:15.7203836Z )
2025-05-07T20:32:15.7204361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7205182Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.7205647Z 
2025-05-07T20:32:15.7205783Z     @given(
2025-05-07T20:32:15.7206138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7206916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7207445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7207979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7208523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7209002Z     )
2025-05-07T20:32:15.7209576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7210329Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7210706Z         self,
2025-05-07T20:32:15.7211012Z         T: int,
2025-05-07T20:32:15.7211325Z         D: int,
2025-05-07T20:32:15.7211686Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7212261Z         contiguous: bool,
2025-05-07T20:32:15.7212660Z         compiled: bool,
2025-05-07T20:32:15.7213005Z     ) -> None:
2025-05-07T20:32:15.7213344Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7214001Z     
2025-05-07T20:32:15.7214467Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7218401Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7222060Z 
2025-05-07T20:32:15.7222258Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7222640Z 
2025-05-07T20:32:15.7222806Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7223541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7224253Z     T=2048,
2025-05-07T20:32:15.7224559Z     D=5120,
2025-05-07T20:32:15.7224863Z     scale_ub=None,
2025-05-07T20:32:15.7225192Z     contiguous=False,
2025-05-07T20:32:15.7225548Z     compiled=False,
2025-05-07T20:32:15.7225867Z )
2025-05-07T20:32:15.7226381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7227308Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.7227776Z 
2025-05-07T20:32:15.7227907Z     @given(
2025-05-07T20:32:15.7228283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7228817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7229340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7230014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7230574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7231067Z     )
2025-05-07T20:32:15.7231686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7232462Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7232870Z         self,
2025-05-07T20:32:15.7233182Z         T: int,
2025-05-07T20:32:15.7233493Z         D: int,
2025-05-07T20:32:15.7233846Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7234304Z         contiguous: bool,
2025-05-07T20:32:15.7234702Z         compiled: bool,
2025-05-07T20:32:15.7235077Z     ) -> None:
2025-05-07T20:32:15.7235426Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7235839Z     
2025-05-07T20:32:15.7236285Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7240345Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7244023Z 
2025-05-07T20:32:15.7244221Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7244597Z 
2025-05-07T20:32:15.7244775Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7245492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7246210Z     T=4096,
2025-05-07T20:32:15.7246490Z     D=7168,
2025-05-07T20:32:15.7246767Z     scale_ub=None,
2025-05-07T20:32:15.7247091Z     contiguous=True,
2025-05-07T20:32:15.7247450Z     compiled=True,
2025-05-07T20:32:15.7247750Z )
2025-05-07T20:32:15.7248212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7249113Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.7249522Z 
2025-05-07T20:32:15.7249647Z     @given(
2025-05-07T20:32:15.7249957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7250402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7250830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7251287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7251873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7252288Z     )
2025-05-07T20:32:15.7252785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7253415Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7253759Z         self,
2025-05-07T20:32:15.7254030Z         T: int,
2025-05-07T20:32:15.7254299Z         D: int,
2025-05-07T20:32:15.7254599Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7254982Z         contiguous: bool,
2025-05-07T20:32:15.7255306Z         compiled: bool,
2025-05-07T20:32:15.7255637Z     ) -> None:
2025-05-07T20:32:15.7255938Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7256269Z     
2025-05-07T20:32:15.7256645Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7259776Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7262738Z 
2025-05-07T20:32:15.7262908Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7263211Z 
2025-05-07T20:32:15.7263359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7263948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7264523Z     T=2048,
2025-05-07T20:32:15.7264789Z     D=5120,
2025-05-07T20:32:15.7265048Z     scale_ub=1200.0,
2025-05-07T20:32:15.7265356Z     contiguous=False,
2025-05-07T20:32:15.7265665Z     compiled=False,
2025-05-07T20:32:15.7265938Z )
2025-05-07T20:32:15.7266383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7267108Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.7267505Z 
2025-05-07T20:32:15.7267617Z     @given(
2025-05-07T20:32:15.7267922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7268358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7268799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7269280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7269762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7270183Z     )
2025-05-07T20:32:15.7270721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7271427Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7271816Z         self,
2025-05-07T20:32:15.7272105Z         T: int,
2025-05-07T20:32:15.7272421Z         D: int,
2025-05-07T20:32:15.7272767Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7273219Z         contiguous: bool,
2025-05-07T20:32:15.7273596Z         compiled: bool,
2025-05-07T20:32:15.7273942Z     ) -> None:
2025-05-07T20:32:15.7274277Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7274657Z     
2025-05-07T20:32:15.7275081Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7279605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.7283012Z 
2025-05-07T20:32:15.7283212Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.7283571Z 
2025-05-07T20:32:15.7283729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7284411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7285094Z     T=4096,
2025-05-07T20:32:15.7285387Z     D=7168,
2025-05-07T20:32:15.7285670Z     scale_ub=1200.0,
2025-05-07T20:32:15.7286010Z     contiguous=True,
2025-05-07T20:32:15.7286358Z     compiled=False,
2025-05-07T20:32:15.7286679Z )
2025-05-07T20:32:15.8330049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8331003Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.8331473Z 
2025-05-07T20:32:15.8331589Z     @given(
2025-05-07T20:32:15.8332035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8332534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8333028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8333867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8334388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8334839Z     )
2025-05-07T20:32:15.8335404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8336162Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8336666Z         self,
2025-05-07T20:32:15.8336964Z         T: int,
2025-05-07T20:32:15.8337267Z         D: int,
2025-05-07T20:32:15.8337593Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8338017Z         contiguous: bool,
2025-05-07T20:32:15.8338395Z         compiled: bool,
2025-05-07T20:32:15.8338744Z     ) -> None:
2025-05-07T20:32:15.8339076Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8339465Z     
2025-05-07T20:32:15.8339889Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8343589Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8347036Z 
2025-05-07T20:32:15.8347212Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8347570Z 
2025-05-07T20:32:15.8347730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8348403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8349065Z     T=16384,
2025-05-07T20:32:15.8349364Z     D=7168,
2025-05-07T20:32:15.8349649Z     scale_ub=None,
2025-05-07T20:32:15.8349933Z     contiguous=False,
2025-05-07T20:32:15.8350239Z     compiled=True,
2025-05-07T20:32:15.8350529Z )
2025-05-07T20:32:15.8350992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8351766Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.8352228Z 
2025-05-07T20:32:15.8352360Z     @given(
2025-05-07T20:32:15.8352692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8353182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8353680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8354456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8355023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8355508Z     )
2025-05-07T20:32:15.8356111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8356862Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8357274Z         self,
2025-05-07T20:32:15.8357593Z         T: int,
2025-05-07T20:32:15.8357912Z         D: int,
2025-05-07T20:32:15.8370009Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8370512Z         contiguous: bool,
2025-05-07T20:32:15.8370903Z         compiled: bool,
2025-05-07T20:32:15.8371269Z     ) -> None:
2025-05-07T20:32:15.8371621Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8372123Z     
2025-05-07T20:32:15.8372585Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8376336Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8379707Z 
2025-05-07T20:32:15.8379912Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8380278Z 
2025-05-07T20:32:15.8380453Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8381150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8381845Z     T=4096,
2025-05-07T20:32:15.8382239Z     D=7168,
2025-05-07T20:32:15.8382541Z     scale_ub=None,
2025-05-07T20:32:15.8382894Z     contiguous=True,
2025-05-07T20:32:15.8383264Z     compiled=False,
2025-05-07T20:32:15.8383586Z )
2025-05-07T20:32:15.8384127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8384980Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.8385453Z 
2025-05-07T20:32:15.8385581Z     @given(
2025-05-07T20:32:15.8385943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8386464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8386977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8387525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8388065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8388530Z     )
2025-05-07T20:32:15.8389112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8389881Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8390276Z         self,
2025-05-07T20:32:15.8390585Z         T: int,
2025-05-07T20:32:15.8390892Z         D: int,
2025-05-07T20:32:15.8391248Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8391695Z         contiguous: bool,
2025-05-07T20:32:15.8392079Z         compiled: bool,
2025-05-07T20:32:15.8392437Z     ) -> None:
2025-05-07T20:32:15.8392782Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8393169Z     
2025-05-07T20:32:15.8393604Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8397390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8400576Z 
2025-05-07T20:32:15.8400888Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8401228Z 
2025-05-07T20:32:15.8401385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8402033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8402665Z     T=16384,
2025-05-07T20:32:15.8402967Z     D=7168,
2025-05-07T20:32:15.8403240Z     scale_ub=None,
2025-05-07T20:32:15.8403571Z     contiguous=True,
2025-05-07T20:32:15.8403919Z     compiled=False,
2025-05-07T20:32:15.8404224Z )
2025-05-07T20:32:15.8404735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8405585Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.8406070Z 
2025-05-07T20:32:15.8406712Z     @given(
2025-05-07T20:32:15.8407100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8407668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8408200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8408753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8409309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8409793Z     )
2025-05-07T20:32:15.8410376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8411157Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8411557Z         self,
2025-05-07T20:32:15.8412123Z         T: int,
2025-05-07T20:32:15.8412454Z         D: int,
2025-05-07T20:32:15.8412805Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8413251Z         contiguous: bool,
2025-05-07T20:32:15.8413633Z         compiled: bool,
2025-05-07T20:32:15.8413995Z     ) -> None:
2025-05-07T20:32:15.8414338Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8414863Z     
2025-05-07T20:32:15.8415307Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8419073Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8422516Z 
2025-05-07T20:32:15.8422728Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8423090Z 
2025-05-07T20:32:15.8423266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8423962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8424663Z     T=16384,
2025-05-07T20:32:15.8424968Z     D=7168,
2025-05-07T20:32:15.8425263Z     scale_ub=1200.0,
2025-05-07T20:32:15.8425618Z     contiguous=True,
2025-05-07T20:32:15.8425985Z     compiled=False,
2025-05-07T20:32:15.8426300Z )
2025-05-07T20:32:15.8426808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8427697Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.8428176Z 
2025-05-07T20:32:15.8428300Z     @given(
2025-05-07T20:32:15.8428674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8429208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8429721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8430286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8430846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8431332Z     )
2025-05-07T20:32:15.8431927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8432699Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8433096Z         self,
2025-05-07T20:32:15.8433399Z         T: int,
2025-05-07T20:32:15.8433925Z         D: int,
2025-05-07T20:32:15.8434299Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8434735Z         contiguous: bool,
2025-05-07T20:32:15.8435139Z         compiled: bool,
2025-05-07T20:32:15.8435516Z     ) -> None:
2025-05-07T20:32:15.8435861Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8436263Z     
2025-05-07T20:32:15.8436707Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8440543Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.8443988Z 
2025-05-07T20:32:15.8444200Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.8444566Z 
2025-05-07T20:32:15.8444733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8445436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8446128Z     T=128,
2025-05-07T20:32:15.8446519Z     D=5120,
2025-05-07T20:32:15.8446826Z     scale_ub=1200.0,
2025-05-07T20:32:15.8447189Z     contiguous=False,
2025-05-07T20:32:15.8447552Z     compiled=False,
2025-05-07T20:32:15.8447888Z )
2025-05-07T20:32:15.9681061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9682511Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.9683838Z 
2025-05-07T20:32:15.9684008Z     @given(
2025-05-07T20:32:15.9684467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9685098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9685694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9686349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9687002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9687330Z     )
2025-05-07T20:32:15.9687678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9688136Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9688384Z         self,
2025-05-07T20:32:15.9688577Z         T: int,
2025-05-07T20:32:15.9688778Z         D: int,
2025-05-07T20:32:15.9688999Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9689268Z         contiguous: bool,
2025-05-07T20:32:15.9689514Z         compiled: bool,
2025-05-07T20:32:15.9689753Z     ) -> None:
2025-05-07T20:32:15.9689970Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9690219Z     
2025-05-07T20:32:15.9690497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9690846Z     
2025-05-07T20:32:15.9691044Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9691344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9691660Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9691995Z         x0 = x[:, :D]
2025-05-07T20:32:15.9692213Z         x1 = x[:, D:]
2025-05-07T20:32:15.9692421Z     
2025-05-07T20:32:15.9692607Z         if contiguous:
2025-05-07T20:32:15.9692842Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9693101Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9693340Z     
2025-05-07T20:32:15.9693529Z         if scale_ub is not None:
2025-05-07T20:32:15.9693803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9694143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9694453Z             )
2025-05-07T20:32:15.9694648Z         else:
2025-05-07T20:32:15.9694863Z             scale_ub_tensor = None
2025-05-07T20:32:15.9695111Z     
2025-05-07T20:32:15.9695516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9695840Z             op = silu_mul_quant
2025-05-07T20:32:15.9696088Z             if compiled:
2025-05-07T20:32:15.9696337Z                 op = torch.compile(op)
2025-05-07T20:32:15.9696637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9696908Z     
2025-05-07T20:32:15.9697099Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9697276Z 
2025-05-07T20:32:15.9697376Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9697680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9698013Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9698298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9699013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9699721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9700278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9700983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9701670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9702215Z     kernel = self.compile(
2025-05-07T20:32:15.9702775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9703546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9703948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9704193Z 
2025-05-07T20:32:15.9704452Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7845f10>
2025-05-07T20:32:15.9705576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9707432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7910cc0>}
2025-05-07T20:32:15.9708822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9709879Z context = <triton._C.libtriton.ir.context object at 0x7f94f791f470>
2025-05-07T20:32:15.9710183Z 
2025-05-07T20:32:15.9710355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9710900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9711381Z                            module_map=module_map)
2025-05-07T20:32:15.9711753Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9712116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9712381Z E       ^
2025-05-07T20:32:15.9712850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9713320Z 
2025-05-07T20:32:15.9713750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9714288Z 
2025-05-07T20:32:15.9714395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9714829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9715253Z     T=2048,
2025-05-07T20:32:15.9715510Z     D=7168,
2025-05-07T20:32:15.9715804Z     scale_ub=None,
2025-05-07T20:32:15.9716138Z     contiguous=False,
2025-05-07T20:32:15.9716461Z     compiled=False,
2025-05-07T20:32:15.9716756Z )
2025-05-07T20:32:15.9717392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9718076Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.9718505Z 
2025-05-07T20:32:15.9718627Z     @given(
2025-05-07T20:32:15.9718988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9719467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9719948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9720473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9721022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9721474Z     )
2025-05-07T20:32:15.9722032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9722798Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9723158Z         self,
2025-05-07T20:32:15.9723417Z         T: int,
2025-05-07T20:32:15.9723683Z         D: int,
2025-05-07T20:32:15.9723975Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9724303Z         contiguous: bool,
2025-05-07T20:32:15.9724548Z         compiled: bool,
2025-05-07T20:32:15.9724773Z     ) -> None:
2025-05-07T20:32:15.9724984Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9725231Z     
2025-05-07T20:32:15.9725509Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9727641Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:15.9729737Z 
2025-05-07T20:32:15.9729872Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:15.9730089Z 
2025-05-07T20:32:15.9730193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9730619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9731033Z     T=128,
2025-05-07T20:32:15.9731217Z     D=7168,
2025-05-07T20:32:15.9731412Z     scale_ub=1200.0,
2025-05-07T20:32:15.9731639Z     contiguous=True,
2025-05-07T20:32:15.9731945Z     compiled=True,
2025-05-07T20:32:15.9732149Z )
2025-05-07T20:32:16.0035995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0037159Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.0037530Z 
2025-05-07T20:32:16.0037634Z     @given(
2025-05-07T20:32:16.0037957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0038348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0038648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0038991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0039327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0039611Z     )
2025-05-07T20:32:16.0039997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0040443Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0040692Z         self,
2025-05-07T20:32:16.0040890Z         T: int,
2025-05-07T20:32:16.0041089Z         D: int,
2025-05-07T20:32:16.0041306Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0041580Z         contiguous: bool,
2025-05-07T20:32:16.0041818Z         compiled: bool,
2025-05-07T20:32:16.0042047Z     ) -> None:
2025-05-07T20:32:16.0042265Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0042507Z     
2025-05-07T20:32:16.0042787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0043138Z     
2025-05-07T20:32:16.0043343Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0043883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0044205Z         x = x_sign * x_clamp
2025-05-07T20:32:16.0044448Z         x0 = x[:, :D]
2025-05-07T20:32:16.0044661Z         x1 = x[:, D:]
2025-05-07T20:32:16.0044869Z     
2025-05-07T20:32:16.0045056Z         if contiguous:
2025-05-07T20:32:16.0045283Z             x0 = x0.contiguous()
2025-05-07T20:32:16.0045544Z             x1 = x1.contiguous()
2025-05-07T20:32:16.0045788Z     
2025-05-07T20:32:16.0045978Z         if scale_ub is not None:
2025-05-07T20:32:16.0046254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.0046597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.0046906Z             )
2025-05-07T20:32:16.0047098Z         else:
2025-05-07T20:32:16.0047316Z             scale_ub_tensor = None
2025-05-07T20:32:16.0047567Z     
2025-05-07T20:32:16.0047805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.0048124Z             op = silu_mul_quant
2025-05-07T20:32:16.0048387Z             if compiled:
2025-05-07T20:32:16.0048631Z                 op = torch.compile(op)
2025-05-07T20:32:16.0048931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.0049209Z     
2025-05-07T20:32:16.0049396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.0049567Z 
2025-05-07T20:32:16.0049667Z moe/activation_test.py:117: 
2025-05-07T20:32:16.0049969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0050386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.0050671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.0051250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.0051909Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.0052663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.0053376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.0053935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.0054632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.0055319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.0055872Z     kernel = self.compile(
2025-05-07T20:32:16.0056428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.0057108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.0057562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0057802Z 
2025-05-07T20:32:16.0058023Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7689730>
2025-05-07T20:32:16.0059145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.0060572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7911a80>}
2025-05-07T20:32:16.0061959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.0063019Z context = <triton._C.libtriton.ir.context object at 0x7f94f76f88f0>
2025-05-07T20:32:16.0063314Z 
2025-05-07T20:32:16.0063498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.0064028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.0064597Z                            module_map=module_map)
2025-05-07T20:32:16.0064977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.0065339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.0065598Z E       ^
2025-05-07T20:32:16.0066075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.0066537Z 
2025-05-07T20:32:16.0066973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.0067505Z 
2025-05-07T20:32:16.0067608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0068034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0068453Z     T=128,
2025-05-07T20:32:16.0068644Z     D=7168,
2025-05-07T20:32:16.0068832Z     scale_ub=1200.0,
2025-05-07T20:32:16.0069061Z     contiguous=True,
2025-05-07T20:32:16.0069286Z     compiled=False,
2025-05-07T20:32:16.0069489Z )
2025-05-07T20:32:16.0069821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0070329Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.0070606Z 
2025-05-07T20:32:16.0070684Z     @given(
2025-05-07T20:32:16.0070918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0071236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0071628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0071965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0072298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0072588Z     )
2025-05-07T20:32:16.0072939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0073434Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0073678Z         self,
2025-05-07T20:32:16.0073868Z         T: int,
2025-05-07T20:32:16.0074067Z         D: int,
2025-05-07T20:32:16.0074291Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0074564Z         contiguous: bool,
2025-05-07T20:32:16.0074807Z         compiled: bool,
2025-05-07T20:32:16.0075036Z     ) -> None:
2025-05-07T20:32:16.0075247Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0075491Z     
2025-05-07T20:32:16.0075766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0076114Z     
2025-05-07T20:32:16.0076311Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0076603Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0078746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.0080681Z 
2025-05-07T20:32:16.0080807Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.0081022Z 
2025-05-07T20:32:16.0081124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0081548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0081966Z     T=128,
2025-05-07T20:32:16.0082150Z     D=5120,
2025-05-07T20:32:16.0082346Z     scale_ub=1200.0,
2025-05-07T20:32:16.0082573Z     contiguous=True,
2025-05-07T20:32:16.0082793Z     compiled=True,
2025-05-07T20:32:16.0083000Z )
2025-05-07T20:32:16.0083326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0083834Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.0084108Z 
2025-05-07T20:32:16.0084185Z     @given(
2025-05-07T20:32:16.0084509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0084828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0085131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0085464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0085797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0086080Z     )
2025-05-07T20:32:16.0086436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0086891Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0087140Z         self,
2025-05-07T20:32:16.0087355Z         T: int,
2025-05-07T20:32:16.0087576Z         D: int,
2025-05-07T20:32:16.0087795Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0088066Z         contiguous: bool,
2025-05-07T20:32:16.0088309Z         compiled: bool,
2025-05-07T20:32:16.0088534Z     ) -> None:
2025-05-07T20:32:16.0088750Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0088992Z     
2025-05-07T20:32:16.0089295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0089645Z     
2025-05-07T20:32:16.0089835Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0090129Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0101385Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.0103498Z 
2025-05-07T20:32:16.0103623Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.0103854Z 
2025-05-07T20:32:16.0103969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0104399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0104815Z     T=128,
2025-05-07T20:32:16.0105005Z     D=7168,
2025-05-07T20:32:16.0105203Z     scale_ub=None,
2025-05-07T20:32:16.0105422Z     contiguous=True,
2025-05-07T20:32:16.0105643Z     compiled=True,
2025-05-07T20:32:16.0105855Z )
2025-05-07T20:32:16.4968068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4968788Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.4969070Z 
2025-05-07T20:32:16.4969152Z     @given(
2025-05-07T20:32:16.4969399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4969736Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4970047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4970388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4970730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4971023Z     )
2025-05-07T20:32:16.4971383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4971935Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4972183Z         self,
2025-05-07T20:32:16.4972388Z         T: int,
2025-05-07T20:32:16.4972595Z         D: int,
2025-05-07T20:32:16.4972821Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4973103Z         contiguous: bool,
2025-05-07T20:32:16.4973352Z         compiled: bool,
2025-05-07T20:32:16.4973583Z     ) -> None:
2025-05-07T20:32:16.4973808Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4974059Z     
2025-05-07T20:32:16.4974337Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4976809Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.4978833Z 
2025-05-07T20:32:16.4978956Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.4979182Z 
2025-05-07T20:32:16.5019856Z FAILED
2025-05-07T20:32:16.5020033Z 
2025-05-07T20:32:16.5020514Z =================================== FAILURES ===================================
2025-05-07T20:32:16.5021152Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:16.5021779Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:16.5022654Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:16.5023425Z   |     yield
2025-05-07T20:32:16.5024022Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:16.5024748Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:16.5025530Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:16.5026603Z   |     if method() is not None:
2025-05-07T20:32:16.5026953Z   |        ^^^^^^^^
2025-05-07T20:32:16.5027856Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:16.5028883Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5029397Z   |            ^^^^^^^
2025-05-07T20:32:16.5030190Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:16.5031101Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:16.5031698Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:16.5032282Z   +-+---------------- 1 ----------------
2025-05-07T20:32:16.5032687Z     | Traceback (most recent call last):
2025-05-07T20:32:16.5033692Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.5034802Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5035320Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5038167Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5040990Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.5041607Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5042176Z     |     T=2048,
2025-05-07T20:32:16.5042499Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:16.5042975Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:16.5043485Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:16.5043994Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:16.5044423Z     | )
2025-05-07T20:32:16.5044672Z     | 
2025-05-07T20:32:16.5045557Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:16.5046445Z     +---------------- 2 ----------------
2025-05-07T20:32:16.5046849Z     | Traceback (most recent call last):
2025-05-07T20:32:16.5047844Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.5048929Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5049451Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5052259Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5055144Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.5055755Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5057057Z     |     T=128,
2025-05-07T20:32:16.5057339Z     |     D=7168,
2025-05-07T20:32:16.5057634Z     |     scale_ub=None,
2025-05-07T20:32:16.5057966Z     |     contiguous=True,
2025-05-07T20:32:16.5058241Z     |     compiled=True,
2025-05-07T20:32:16.5058481Z     | )
2025-05-07T20:32:16.5058667Z     | 
2025-05-07T20:32:16.5059214Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.5059910Z     +---------------- 3 ----------------
2025-05-07T20:32:16.5060208Z     | Traceback (most recent call last):
2025-05-07T20:32:16.5060947Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:16.5061755Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5062144Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5064516Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5066582Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.5067037Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5067462Z     |     T=128,
2025-05-07T20:32:16.5067668Z     |     D=5120,
2025-05-07T20:32:16.5067882Z     |     scale_ub=1200.0,
2025-05-07T20:32:16.5068132Z     |     contiguous=True,
2025-05-07T20:32:16.5068384Z     |     compiled=True,
2025-05-07T20:32:16.5068617Z     | )
2025-05-07T20:32:16.5068802Z     | 
2025-05-07T20:32:16.5069343Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.5069964Z     +---------------- 4 ----------------
2025-05-07T20:32:16.5070272Z     | Traceback (most recent call last):
2025-05-07T20:32:16.5071014Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:16.5071851Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5072145Z     |                              ^^^^^^^^
2025-05-07T20:32:16.5072807Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:16.5073645Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5074118Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5075242Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:16.5076363Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5077230Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:16.5078279Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5078906Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5079816Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:16.5080928Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5081648Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5082582Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:16.5083582Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5084154Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5085004Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:16.5085819Z     |     fn()
2025-05-07T20:32:16.5086631Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:16.5087543Z     |     self.fn.run(
2025-05-07T20:32:16.5088303Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:16.5089138Z     |     kernel = self.compile(
2025-05-07T20:32:16.5089500Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:16.5090349Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:16.5091359Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5092018Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5092938Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.5094075Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5094758Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:16.5095285Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5095780Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5096151Z     | ^
2025-05-07T20:32:16.5096800Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5097618Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:16.5098178Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:16.5098910Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5099686Z     |     T=1,  # or any other generated value
2025-05-07T20:32:16.5100110Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:16.5100576Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:16.5101074Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:16.5101570Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:16.5101987Z     | )
2025-05-07T20:32:16.5102238Z     | 
2025-05-07T20:32:16.5102964Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:16.5103820Z     +------------------------------------
2025-05-07T20:32:16.5104322Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:16.5104859Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5105428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5121799Z     T=1,
2025-05-07T20:32:16.5122090Z     D=5120,
2025-05-07T20:32:16.5122353Z     scale_ub=None,
2025-05-07T20:32:16.5122656Z     contiguous=True,
2025-05-07T20:32:16.5122967Z     compiled=True,
2025-05-07T20:32:16.5123247Z )
2025-05-07T20:32:16.5123700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5124398Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5124956Z 
2025-05-07T20:32:16.5125064Z     @given(
2025-05-07T20:32:16.5125377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5125803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5126225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5126662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5127191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5127572Z     )
2025-05-07T20:32:16.5128028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5128643Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5128992Z         self,
2025-05-07T20:32:16.5129241Z         T: int,
2025-05-07T20:32:16.5129502Z         D: int,
2025-05-07T20:32:16.5129796Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5130172Z         contiguous: bool,
2025-05-07T20:32:16.5130507Z         compiled: bool,
2025-05-07T20:32:16.5130807Z     ) -> None:
2025-05-07T20:32:16.5131097Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5131435Z     
2025-05-07T20:32:16.5131921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5132394Z     
2025-05-07T20:32:16.5132644Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5133014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5133449Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5133782Z         x0 = x[:, :D]
2025-05-07T20:32:16.5134082Z         x1 = x[:, D:]
2025-05-07T20:32:16.5134353Z     
2025-05-07T20:32:16.5134596Z         if contiguous:
2025-05-07T20:32:16.5134916Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5135262Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5135572Z     
2025-05-07T20:32:16.5135828Z         if scale_ub is not None:
2025-05-07T20:32:16.5136338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5136792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5137263Z             )
2025-05-07T20:32:16.5137515Z         else:
2025-05-07T20:32:16.5137784Z             scale_ub_tensor = None
2025-05-07T20:32:16.5138117Z     
2025-05-07T20:32:16.5138424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5138830Z             op = silu_mul_quant
2025-05-07T20:32:16.5139155Z             if compiled:
2025-05-07T20:32:16.5139491Z                 op = torch.compile(op)
2025-05-07T20:32:16.5139876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5140224Z     
2025-05-07T20:32:16.5140469Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5140998Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5141376Z     
2025-05-07T20:32:16.5141681Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5142113Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5142484Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5142889Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5143353Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5143754Z     
2025-05-07T20:32:16.5144017Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5144281Z 
2025-05-07T20:32:16.5144415Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5144801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5145236Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5145660Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5146740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5147767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5148497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5149418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5150412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5151388Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5152368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5153287Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5154136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5154858Z     fn()
2025-05-07T20:32:16.5155576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5156400Z     self.fn.run(
2025-05-07T20:32:16.5157049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5157797Z     kernel = self.compile(
2025-05-07T20:32:16.5158552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5159467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5159995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5160312Z 
2025-05-07T20:32:16.5160578Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d6221730>
2025-05-07T20:32:16.5162039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5163931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4960c20>}
2025-05-07T20:32:16.5165734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5167115Z context = <triton._C.libtriton.ir.context object at 0x7f96d4e37c70>
2025-05-07T20:32:16.5167504Z 
2025-05-07T20:32:16.5167719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5168424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5169145Z                            module_map=module_map)
2025-05-07T20:32:16.5169631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5170127Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5170494Z E       ^
2025-05-07T20:32:16.5171154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5171936Z 
2025-05-07T20:32:16.5172541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5173292Z 
2025-05-07T20:32:16.5173444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5174010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5174562Z     T=2048,
2025-05-07T20:32:16.5174807Z     D=5120,
2025-05-07T20:32:16.5175059Z     scale_ub=1200.0,
2025-05-07T20:32:16.5175350Z     contiguous=True,
2025-05-07T20:32:16.5175648Z     compiled=False,
2025-05-07T20:32:16.5175920Z )
2025-05-07T20:32:16.5176356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5177029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.5177406Z 
2025-05-07T20:32:16.5177513Z     @given(
2025-05-07T20:32:16.5177814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5178313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5178730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5179180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5179631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5180024Z     )
2025-05-07T20:32:16.5180546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5181170Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5181494Z         self,
2025-05-07T20:32:16.5181747Z         T: int,
2025-05-07T20:32:16.5182015Z         D: int,
2025-05-07T20:32:16.5182297Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5182656Z         contiguous: bool,
2025-05-07T20:32:16.5182962Z         compiled: bool,
2025-05-07T20:32:16.5183265Z     ) -> None:
2025-05-07T20:32:16.5183547Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5183865Z     
2025-05-07T20:32:16.5184218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5184674Z     
2025-05-07T20:32:16.5184918Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5185298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5185705Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5186012Z         x0 = x[:, :D]
2025-05-07T20:32:16.5186293Z         x1 = x[:, D:]
2025-05-07T20:32:16.5186573Z     
2025-05-07T20:32:16.5186812Z         if contiguous:
2025-05-07T20:32:16.5187109Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5187437Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5187742Z     
2025-05-07T20:32:16.5187982Z         if scale_ub is not None:
2025-05-07T20:32:16.5188339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5188768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5189165Z             )
2025-05-07T20:32:16.5189426Z         else:
2025-05-07T20:32:16.5189700Z             scale_ub_tensor = None
2025-05-07T20:32:16.5190023Z     
2025-05-07T20:32:16.5190342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5190777Z             op = silu_mul_quant
2025-05-07T20:32:16.5191116Z             if compiled:
2025-05-07T20:32:16.5191457Z                 op = torch.compile(op)
2025-05-07T20:32:16.5191869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5192248Z     
2025-05-07T20:32:16.5192512Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5192740Z 
2025-05-07T20:32:16.5192883Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5193390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5193855Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5194249Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5195234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5196223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5196998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5198026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5198934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5199643Z     kernel = self.compile(
2025-05-07T20:32:16.5200367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5201251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5222853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5223205Z 
2025-05-07T20:32:16.5223488Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d4af90a0>
2025-05-07T20:32:16.5225016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5227301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4820180>}
2025-05-07T20:32:16.5229335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5230812Z context = <triton._C.libtriton.ir.context object at 0x7f96cef92cb0>
2025-05-07T20:32:16.5231222Z 
2025-05-07T20:32:16.5231453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5232195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5232845Z                            module_map=module_map)
2025-05-07T20:32:16.5233326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5233790Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5234133Z E       ^
2025-05-07T20:32:16.5234744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5235381Z 
2025-05-07T20:32:16.5235963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5236677Z 
2025-05-07T20:32:16.5236824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5237389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5237945Z     T=2048,
2025-05-07T20:32:16.5238196Z     D=5120,
2025-05-07T20:32:16.5238451Z     scale_ub=1200.0,
2025-05-07T20:32:16.5238744Z     contiguous=True,
2025-05-07T20:32:16.5239047Z     compiled=True,
2025-05-07T20:32:16.5239324Z )
2025-05-07T20:32:16.5239753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5240431Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.5240808Z 
2025-05-07T20:32:16.5240915Z     @given(
2025-05-07T20:32:16.5241223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5241642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5242053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5242498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5243151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5243562Z     )
2025-05-07T20:32:16.5244059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5244684Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5245016Z         self,
2025-05-07T20:32:16.5245274Z         T: int,
2025-05-07T20:32:16.5245552Z         D: int,
2025-05-07T20:32:16.5245812Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5246092Z         contiguous: bool,
2025-05-07T20:32:16.5246346Z         compiled: bool,
2025-05-07T20:32:16.5246572Z     ) -> None:
2025-05-07T20:32:16.5246799Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5247046Z     
2025-05-07T20:32:16.5247319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5247668Z     
2025-05-07T20:32:16.5247871Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5248163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5248478Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5248730Z         x0 = x[:, :D]
2025-05-07T20:32:16.5248947Z         x1 = x[:, D:]
2025-05-07T20:32:16.5249161Z     
2025-05-07T20:32:16.5249350Z         if contiguous:
2025-05-07T20:32:16.5249585Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5249843Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5250087Z     
2025-05-07T20:32:16.5250279Z         if scale_ub is not None:
2025-05-07T20:32:16.5250621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5250966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5251288Z             )
2025-05-07T20:32:16.5251481Z         else:
2025-05-07T20:32:16.5251695Z             scale_ub_tensor = None
2025-05-07T20:32:16.5252094Z     
2025-05-07T20:32:16.5252385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5252710Z             op = silu_mul_quant
2025-05-07T20:32:16.5252964Z             if compiled:
2025-05-07T20:32:16.5253210Z                 op = torch.compile(op)
2025-05-07T20:32:16.5253514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5253797Z     
2025-05-07T20:32:16.5253988Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5254280Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5254574Z     
2025-05-07T20:32:16.5254816Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5255158Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5255461Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5255786Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5256148Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5256464Z     
2025-05-07T20:32:16.5256673Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5256870Z 
2025-05-07T20:32:16.5256975Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5257284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5257633Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5257970Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5258780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5259562Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5260124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5260823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5261534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5262282Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5263118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5263772Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5264387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5264916Z     fn()
2025-05-07T20:32:16.5265435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5266029Z     self.fn.run(
2025-05-07T20:32:16.5266506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5267052Z     kernel = self.compile(
2025-05-07T20:32:16.5267600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5268276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5268692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5268926Z 
2025-05-07T20:32:16.5269143Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d45f5640>
2025-05-07T20:32:16.5270253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5271728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d45eaa20>}
2025-05-07T20:32:16.5273117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5274218Z context = <triton._C.libtriton.ir.context object at 0x7f96ced67d70>
2025-05-07T20:32:16.5274514Z 
2025-05-07T20:32:16.5274687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5275225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5275702Z                            module_map=module_map)
2025-05-07T20:32:16.5276073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5276434Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5276713Z E       ^
2025-05-07T20:32:16.5277193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5277657Z 
2025-05-07T20:32:16.5278090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5278623Z 
2025-05-07T20:32:16.5278728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5279152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5279574Z     T=16384,
2025-05-07T20:32:16.5279762Z     D=7168,
2025-05-07T20:32:16.5279956Z     scale_ub=1200.0,
2025-05-07T20:32:16.5280183Z     contiguous=False,
2025-05-07T20:32:16.5280408Z     compiled=False,
2025-05-07T20:32:16.5280612Z )
2025-05-07T20:32:16.5280936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5281446Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.5281745Z 
2025-05-07T20:32:16.5281824Z     @given(
2025-05-07T20:32:16.5282061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5282382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5282689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5283029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5283364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5283650Z     )
2025-05-07T20:32:16.5284095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5284552Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5284789Z         self,
2025-05-07T20:32:16.5284987Z         T: int,
2025-05-07T20:32:16.5285188Z         D: int,
2025-05-07T20:32:16.5285404Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5285680Z         contiguous: bool,
2025-05-07T20:32:16.5285923Z         compiled: bool,
2025-05-07T20:32:16.5286153Z     ) -> None:
2025-05-07T20:32:16.5286366Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5286606Z     
2025-05-07T20:32:16.5286883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5287222Z     
2025-05-07T20:32:16.5287416Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5287712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5288018Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5288262Z         x0 = x[:, :D]
2025-05-07T20:32:16.5288477Z         x1 = x[:, D:]
2025-05-07T20:32:16.5288678Z     
2025-05-07T20:32:16.5288869Z         if contiguous:
2025-05-07T20:32:16.5289101Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5289356Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5289597Z     
2025-05-07T20:32:16.5289788Z         if scale_ub is not None:
2025-05-07T20:32:16.5290055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5290394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5290760Z             )
2025-05-07T20:32:16.5290955Z         else:
2025-05-07T20:32:16.5291164Z             scale_ub_tensor = None
2025-05-07T20:32:16.5291416Z     
2025-05-07T20:32:16.5291649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5292069Z             op = silu_mul_quant
2025-05-07T20:32:16.5292374Z             if compiled:
2025-05-07T20:32:16.5292626Z                 op = torch.compile(op)
2025-05-07T20:32:16.5292930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5293211Z     
2025-05-07T20:32:16.5293418Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5293585Z 
2025-05-07T20:32:16.5293686Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5293988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5294327Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5294609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5295317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5296031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5296582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5297285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5297976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5298537Z     kernel = self.compile(
2025-05-07T20:32:16.5299095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5299766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5300176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5300413Z 
2025-05-07T20:32:16.5300634Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d447f0e0>
2025-05-07T20:32:16.5301756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5303180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf06b6a0>}
2025-05-07T20:32:16.5304694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5305761Z context = <triton._C.libtriton.ir.context object at 0x7f96ce521970>
2025-05-07T20:32:16.5306057Z 
2025-05-07T20:32:16.5306584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5307128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5307602Z                            module_map=module_map)
2025-05-07T20:32:16.5307974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5308331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5308605Z E       ^
2025-05-07T20:32:16.5309081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5309552Z 
2025-05-07T20:32:16.5309988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5310519Z 
2025-05-07T20:32:16.5310628Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5311047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5311462Z     T=1,
2025-05-07T20:32:16.5311654Z     D=7168,
2025-05-07T20:32:16.5311957Z     scale_ub=None,
2025-05-07T20:32:16.5312176Z     contiguous=True,
2025-05-07T20:32:16.5312404Z     compiled=True,
2025-05-07T20:32:16.5312600Z )
2025-05-07T20:32:16.5312929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5313427Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5313760Z 
2025-05-07T20:32:16.5313843Z     @given(
2025-05-07T20:32:16.5314073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5314394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5314713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5315041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5315376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5315669Z     )
2025-05-07T20:32:16.5316017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5316468Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5316715Z         self,
2025-05-07T20:32:16.5316909Z         T: int,
2025-05-07T20:32:16.5317102Z         D: int,
2025-05-07T20:32:16.5317322Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5317594Z         contiguous: bool,
2025-05-07T20:32:16.5317834Z         compiled: bool,
2025-05-07T20:32:16.5318059Z     ) -> None:
2025-05-07T20:32:16.5318275Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5318513Z     
2025-05-07T20:32:16.5318793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5319139Z     
2025-05-07T20:32:16.5319337Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5319632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5319946Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5320182Z         x0 = x[:, :D]
2025-05-07T20:32:16.5320399Z         x1 = x[:, D:]
2025-05-07T20:32:16.5320608Z     
2025-05-07T20:32:16.5320786Z         if contiguous:
2025-05-07T20:32:16.5321020Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5321280Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5321518Z     
2025-05-07T20:32:16.5321712Z         if scale_ub is not None:
2025-05-07T20:32:16.5321991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5322330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5322642Z             )
2025-05-07T20:32:16.5322842Z         else:
2025-05-07T20:32:16.5323056Z             scale_ub_tensor = None
2025-05-07T20:32:16.5323306Z     
2025-05-07T20:32:16.5323675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5324000Z             op = silu_mul_quant
2025-05-07T20:32:16.5324252Z             if compiled:
2025-05-07T20:32:16.5324504Z                 op = torch.compile(op)
2025-05-07T20:32:16.5324805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5325084Z     
2025-05-07T20:32:16.5325283Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5325573Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5325865Z     
2025-05-07T20:32:16.5326107Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5326448Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5326750Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5327068Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5327437Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5327756Z     
2025-05-07T20:32:16.5327954Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5328158Z 
2025-05-07T20:32:16.5328263Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5328566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5328906Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5329237Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5330047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5330877Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5331436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5332361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5333432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5334520Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5335591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5336524Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5337389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5338156Z     fn()
2025-05-07T20:32:16.5338897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5339749Z     self.fn.run(
2025-05-07T20:32:16.5340445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5341239Z     kernel = self.compile(
2025-05-07T20:32:16.5342061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5343074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5343661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5344001Z 
2025-05-07T20:32:16.5344298Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf2f6720>
2025-05-07T20:32:16.5345909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5348042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec65620>}
2025-05-07T20:32:16.5350268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5351902Z context = <triton._C.libtriton.ir.context object at 0x7f96cea97430>
2025-05-07T20:32:16.5352378Z 
2025-05-07T20:32:16.5352645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5353484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5354238Z                            module_map=module_map)
2025-05-07T20:32:16.5354797Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5355365Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5355768Z E       ^
2025-05-07T20:32:16.5356440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5357182Z 
2025-05-07T20:32:16.5357925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5358856Z 
2025-05-07T20:32:16.5359028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5359717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5360391Z     T=4096,
2025-05-07T20:32:16.5360687Z     D=5120,
2025-05-07T20:32:16.5360987Z     scale_ub=None,
2025-05-07T20:32:16.5361315Z     contiguous=False,
2025-05-07T20:32:16.5361666Z     compiled=False,
2025-05-07T20:32:16.5362052Z )
2025-05-07T20:32:16.5362511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5363307Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5363756Z 
2025-05-07T20:32:16.5363883Z     @given(
2025-05-07T20:32:16.5364231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5364789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5365279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5365811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5366344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5366799Z     )
2025-05-07T20:32:16.5367413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5368149Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5368552Z         self,
2025-05-07T20:32:16.5368848Z         T: int,
2025-05-07T20:32:16.5369123Z         D: int,
2025-05-07T20:32:16.5369435Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5369845Z         contiguous: bool,
2025-05-07T20:32:16.5370229Z         compiled: bool,
2025-05-07T20:32:16.5370587Z     ) -> None:
2025-05-07T20:32:16.5370916Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5371279Z     
2025-05-07T20:32:16.5371723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5372370Z     
2025-05-07T20:32:16.5372675Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5373129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5373637Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5373998Z         x0 = x[:, :D]
2025-05-07T20:32:16.5374328Z         x1 = x[:, D:]
2025-05-07T20:32:16.5374651Z     
2025-05-07T20:32:16.5374926Z         if contiguous:
2025-05-07T20:32:16.5375277Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5375671Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5376044Z     
2025-05-07T20:32:16.5376322Z         if scale_ub is not None:
2025-05-07T20:32:16.5376733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5377261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5377743Z             )
2025-05-07T20:32:16.5378050Z         else:
2025-05-07T20:32:16.5378382Z             scale_ub_tensor = None
2025-05-07T20:32:16.5378784Z     
2025-05-07T20:32:16.5379160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5379673Z             op = silu_mul_quant
2025-05-07T20:32:16.5380082Z             if compiled:
2025-05-07T20:32:16.5380636Z                 op = torch.compile(op)
2025-05-07T20:32:16.5381126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5381569Z     
2025-05-07T20:32:16.5381860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5382145Z 
2025-05-07T20:32:16.5382302Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5382796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5383363Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5383854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5385012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5386182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5387151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5388355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5389603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5390598Z     kernel = self.compile(
2025-05-07T20:32:16.5391558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5392721Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5393497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5393898Z 
2025-05-07T20:32:16.5394228Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf0c3d40>
2025-05-07T20:32:16.5395910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5398184Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec665c0>}
2025-05-07T20:32:16.5400267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5401972Z context = <triton._C.libtriton.ir.context object at 0x7f96ce8e7370>
2025-05-07T20:32:16.5402400Z 
2025-05-07T20:32:16.5402633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5403388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5404088Z                            module_map=module_map)
2025-05-07T20:32:16.5404568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5405051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5405411Z E       ^
2025-05-07T20:32:16.5406108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5406985Z 
2025-05-07T20:32:16.5407674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5408479Z 
2025-05-07T20:32:16.5408625Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5409233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5409822Z     T=4096,
2025-05-07T20:32:16.5410089Z     D=7168,
2025-05-07T20:32:16.5410364Z     scale_ub=None,
2025-05-07T20:32:16.5410678Z     contiguous=False,
2025-05-07T20:32:16.5411011Z     compiled=False,
2025-05-07T20:32:16.5411316Z )
2025-05-07T20:32:16.5411888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5423298Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5423754Z 
2025-05-07T20:32:16.5424112Z     @given(
2025-05-07T20:32:16.5424449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5424900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5425325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5425796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5426265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5426667Z     )
2025-05-07T20:32:16.5427166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5427810Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5428148Z         self,
2025-05-07T20:32:16.5428422Z         T: int,
2025-05-07T20:32:16.5428698Z         D: int,
2025-05-07T20:32:16.5428998Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5429385Z         contiguous: bool,
2025-05-07T20:32:16.5429725Z         compiled: bool,
2025-05-07T20:32:16.5430042Z     ) -> None:
2025-05-07T20:32:16.5430345Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5430691Z     
2025-05-07T20:32:16.5431074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5431557Z     
2025-05-07T20:32:16.5431831Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5432236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5432679Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5433119Z         x0 = x[:, :D]
2025-05-07T20:32:16.5433420Z         x1 = x[:, D:]
2025-05-07T20:32:16.5433707Z     
2025-05-07T20:32:16.5433956Z         if contiguous:
2025-05-07T20:32:16.5434276Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5434638Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5434967Z     
2025-05-07T20:32:16.5435231Z         if scale_ub is not None:
2025-05-07T20:32:16.5435710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5436175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5436613Z             )
2025-05-07T20:32:16.5436887Z         else:
2025-05-07T20:32:16.5437169Z             scale_ub_tensor = None
2025-05-07T20:32:16.5437525Z     
2025-05-07T20:32:16.5437843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5438279Z             op = silu_mul_quant
2025-05-07T20:32:16.5438629Z             if compiled:
2025-05-07T20:32:16.5438980Z                 op = torch.compile(op)
2025-05-07T20:32:16.5439388Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5439777Z     
2025-05-07T20:32:16.5440041Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5440272Z 
2025-05-07T20:32:16.5440415Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5440826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5441308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5441700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5442701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5443701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5444483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5445475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5446435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5447209Z     kernel = self.compile(
2025-05-07T20:32:16.5447987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5448932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5449506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5449850Z 
2025-05-07T20:32:16.5450138Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf001880>
2025-05-07T20:32:16.5451950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5453997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cec676a0>}
2025-05-07T20:32:16.5455959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5457505Z context = <triton._C.libtriton.ir.context object at 0x7f96ce9882b0>
2025-05-07T20:32:16.5457922Z 
2025-05-07T20:32:16.5458161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5458918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5459579Z                            module_map=module_map)
2025-05-07T20:32:16.5460082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5460577Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5460933Z E       ^
2025-05-07T20:32:16.5461600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5462321Z 
2025-05-07T20:32:16.5462931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5463679Z 
2025-05-07T20:32:16.5463831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5464472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5465048Z     T=128,
2025-05-07T20:32:16.5465311Z     D=7168,
2025-05-07T20:32:16.5465571Z     scale_ub=None,
2025-05-07T20:32:16.5465876Z     contiguous=False,
2025-05-07T20:32:16.5466189Z     compiled=True,
2025-05-07T20:32:16.5466463Z )
2025-05-07T20:32:16.5466911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5467612Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.5467994Z 
2025-05-07T20:32:16.5468108Z     @given(
2025-05-07T20:32:16.5468419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5468865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5469300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5469761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5470232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5470642Z     )
2025-05-07T20:32:16.5471140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5471779Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5472115Z         self,
2025-05-07T20:32:16.5472385Z         T: int,
2025-05-07T20:32:16.5472662Z         D: int,
2025-05-07T20:32:16.5472969Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5473351Z         contiguous: bool,
2025-05-07T20:32:16.5473679Z         compiled: bool,
2025-05-07T20:32:16.5473993Z     ) -> None:
2025-05-07T20:32:16.5474291Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5474624Z     
2025-05-07T20:32:16.5475005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5475489Z     
2025-05-07T20:32:16.5475748Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5476154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5476599Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5476931Z         x0 = x[:, :D]
2025-05-07T20:32:16.5477268Z         x1 = x[:, D:]
2025-05-07T20:32:16.5477583Z     
2025-05-07T20:32:16.5477830Z         if contiguous:
2025-05-07T20:32:16.5478156Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5478621Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5478955Z     
2025-05-07T20:32:16.5479222Z         if scale_ub is not None:
2025-05-07T20:32:16.5479602Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5480064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5480499Z             )
2025-05-07T20:32:16.5480763Z         else:
2025-05-07T20:32:16.5481059Z             scale_ub_tensor = None
2025-05-07T20:32:16.5481412Z     
2025-05-07T20:32:16.5481732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5482180Z             op = silu_mul_quant
2025-05-07T20:32:16.5482523Z             if compiled:
2025-05-07T20:32:16.5482867Z                 op = torch.compile(op)
2025-05-07T20:32:16.5483283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5483668Z     
2025-05-07T20:32:16.5483933Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5484328Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5484732Z     
2025-05-07T20:32:16.5485067Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5485544Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5485955Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5486427Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5486975Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5487539Z     
2025-05-07T20:32:16.5487837Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5488135Z 
2025-05-07T20:32:16.5488286Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5488730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5489235Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5489803Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5491081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5492392Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5493252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5494323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5495426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5496575Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5497758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5498761Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5499717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5500525Z     fn()
2025-05-07T20:32:16.5501338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5502251Z     self.fn.run(
2025-05-07T20:32:16.5502970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5503804Z     kernel = self.compile(
2025-05-07T20:32:16.5504658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5505714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5506572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5506955Z 
2025-05-07T20:32:16.5507281Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf0035f0>
2025-05-07T20:32:16.5509178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5511443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ceaf39c0>}
2025-05-07T20:32:16.5513595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5515222Z context = <triton._C.libtriton.ir.context object at 0x7f96ce7a3470>
2025-05-07T20:32:16.5515682Z 
2025-05-07T20:32:16.5515937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5516764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5517493Z                            module_map=module_map)
2025-05-07T20:32:16.5518067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5518607Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5519010Z E       ^
2025-05-07T20:32:16.5519715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5520454Z 
2025-05-07T20:32:16.5521123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5522103Z 
2025-05-07T20:32:16.5522274Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5522907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5523530Z     T=128,
2025-05-07T20:32:16.5523910Z     D=7168,
2025-05-07T20:32:16.5524191Z     scale_ub=None,
2025-05-07T20:32:16.5524503Z     contiguous=False,
2025-05-07T20:32:16.5524838Z     compiled=False,
2025-05-07T20:32:16.5525134Z )
2025-05-07T20:32:16.5525630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5526395Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5526812Z 
2025-05-07T20:32:16.5526937Z     @given(
2025-05-07T20:32:16.5527321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5527791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5528251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5528768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5529281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5529713Z     )
2025-05-07T20:32:16.5530255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5530955Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5531316Z         self,
2025-05-07T20:32:16.5531610Z         T: int,
2025-05-07T20:32:16.5532009Z         D: int,
2025-05-07T20:32:16.5532326Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5532737Z         contiguous: bool,
2025-05-07T20:32:16.5533095Z         compiled: bool,
2025-05-07T20:32:16.5533416Z     ) -> None:
2025-05-07T20:32:16.5533718Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5534062Z     
2025-05-07T20:32:16.5534433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5534944Z     
2025-05-07T20:32:16.5535211Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5535623Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5536074Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5536460Z         x0 = x[:, :D]
2025-05-07T20:32:16.5536789Z         x1 = x[:, D:]
2025-05-07T20:32:16.5537096Z     
2025-05-07T20:32:16.5537399Z         if contiguous:
2025-05-07T20:32:16.5537774Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5538152Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5538500Z     
2025-05-07T20:32:16.5538781Z         if scale_ub is not None:
2025-05-07T20:32:16.5539339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5539849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5540286Z             )
2025-05-07T20:32:16.5540563Z         else:
2025-05-07T20:32:16.5540883Z             scale_ub_tensor = None
2025-05-07T20:32:16.5541277Z     
2025-05-07T20:32:16.5541602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5542120Z             op = silu_mul_quant
2025-05-07T20:32:16.5542508Z             if compiled:
2025-05-07T20:32:16.5542876Z                 op = torch.compile(op)
2025-05-07T20:32:16.5543342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5543777Z     
2025-05-07T20:32:16.5544043Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5544296Z 
2025-05-07T20:32:16.5544437Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5544870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5545347Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5545738Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5546753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5547969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5548764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5549915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5550906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5551698Z     kernel = self.compile(
2025-05-07T20:32:16.5552494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5553539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5554118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5554454Z 
2025-05-07T20:32:16.5554759Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce841100>
2025-05-07T20:32:16.5556374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5558438Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf259800>}
2025-05-07T20:32:16.5560473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5561990Z context = <triton._C.libtriton.ir.context object at 0x7f96ce3f01f0>
2025-05-07T20:32:16.5562422Z 
2025-05-07T20:32:16.5562656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5563423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5564113Z                            module_map=module_map)
2025-05-07T20:32:16.5564627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5565135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5565512Z E       ^
2025-05-07T20:32:16.5566195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5566874Z 
2025-05-07T20:32:16.5567496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5568273Z 
2025-05-07T20:32:16.5568416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5569143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5569743Z     T=4096,
2025-05-07T20:32:16.5570006Z     D=5120,
2025-05-07T20:32:16.5570285Z     scale_ub=1200.0,
2025-05-07T20:32:16.5570604Z     contiguous=True,
2025-05-07T20:32:16.5570917Z     compiled=False,
2025-05-07T20:32:16.5571211Z )
2025-05-07T20:32:16.5571672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5572513Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.5572923Z 
2025-05-07T20:32:16.5573031Z     @given(
2025-05-07T20:32:16.5573354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5573806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5574236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5574708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5575189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5575594Z     )
2025-05-07T20:32:16.5576974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5577681Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5578020Z         self,
2025-05-07T20:32:16.5578292Z         T: int,
2025-05-07T20:32:16.5578583Z         D: int,
2025-05-07T20:32:16.5578880Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5579275Z         contiguous: bool,
2025-05-07T20:32:16.5579703Z         compiled: bool,
2025-05-07T20:32:16.5580029Z     ) -> None:
2025-05-07T20:32:16.5580330Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5580686Z     
2025-05-07T20:32:16.5581079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5581577Z     
2025-05-07T20:32:16.5581861Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5582327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5582762Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5583102Z         x0 = x[:, :D]
2025-05-07T20:32:16.5583413Z         x1 = x[:, D:]
2025-05-07T20:32:16.5583711Z     
2025-05-07T20:32:16.5583975Z         if contiguous:
2025-05-07T20:32:16.5584305Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5584662Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5585006Z     
2025-05-07T20:32:16.5585282Z         if scale_ub is not None:
2025-05-07T20:32:16.5585664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5586147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5586597Z             )
2025-05-07T20:32:16.5586865Z         else:
2025-05-07T20:32:16.5587150Z             scale_ub_tensor = None
2025-05-07T20:32:16.5587511Z     
2025-05-07T20:32:16.5587844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5588289Z             op = silu_mul_quant
2025-05-07T20:32:16.5588643Z             if compiled:
2025-05-07T20:32:16.5588991Z                 op = torch.compile(op)
2025-05-07T20:32:16.5589403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5589797Z     
2025-05-07T20:32:16.5590072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5590302Z 
2025-05-07T20:32:16.5590441Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5590860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5591341Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5591732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5592756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5593784Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5594567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5595578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5596564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5597454Z     kernel = self.compile(
2025-05-07T20:32:16.5598245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5599210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5599785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5600123Z 
2025-05-07T20:32:16.5600418Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce843470>
2025-05-07T20:32:16.5601591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5602358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d49fb380>}
2025-05-07T20:32:16.5603468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5603743Z context = <triton._C.libtriton.ir.context object at 0x7f96399eadb0>
2025-05-07T20:32:16.5603759Z 
2025-05-07T20:32:16.5603993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5604446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5604602Z                            module_map=module_map)
2025-05-07T20:32:16.5604823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5605020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5605136Z E       ^
2025-05-07T20:32:16.5605665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5605678Z 
2025-05-07T20:32:16.5606544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5606553Z 
2025-05-07T20:32:16.5606701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5607019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5607141Z     T=1,
2025-05-07T20:32:16.5607263Z     D=5120,
2025-05-07T20:32:16.5607395Z     scale_ub=None,
2025-05-07T20:32:16.5607548Z     contiguous=True,
2025-05-07T20:32:16.5607663Z     compiled=True,
2025-05-07T20:32:16.5607762Z )
2025-05-07T20:32:16.5608089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5608328Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5608339Z 
2025-05-07T20:32:16.5608452Z     @given(
2025-05-07T20:32:16.5608620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5608766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5608934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5609103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5609264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5609377Z     )
2025-05-07T20:32:16.5609744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5609883Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5609989Z         self,
2025-05-07T20:32:16.5610098Z         T: int,
2025-05-07T20:32:16.5610214Z         D: int,
2025-05-07T20:32:16.5610347Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5610470Z         contiguous: bool,
2025-05-07T20:32:16.5610606Z         compiled: bool,
2025-05-07T20:32:16.5610723Z     ) -> None:
2025-05-07T20:32:16.5610854Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5610967Z     
2025-05-07T20:32:16.5611207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5611559Z     
2025-05-07T20:32:16.5611699Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5611978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5612104Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5612227Z         x0 = x[:, :D]
2025-05-07T20:32:16.5612339Z         x1 = x[:, D:]
2025-05-07T20:32:16.5612449Z     
2025-05-07T20:32:16.5612569Z         if contiguous:
2025-05-07T20:32:16.5612701Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5612834Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5612935Z     
2025-05-07T20:32:16.5613061Z         if scale_ub is not None:
2025-05-07T20:32:16.5613211Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5613398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5613510Z             )
2025-05-07T20:32:16.5613629Z         else:
2025-05-07T20:32:16.5613758Z             scale_ub_tensor = None
2025-05-07T20:32:16.5613863Z     
2025-05-07T20:32:16.5614062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5614185Z             op = silu_mul_quant
2025-05-07T20:32:16.5614312Z             if compiled:
2025-05-07T20:32:16.5614449Z                 op = torch.compile(op)
2025-05-07T20:32:16.5614596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5614708Z     
2025-05-07T20:32:16.5614832Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5615090Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5615201Z     
2025-05-07T20:32:16.5615394Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5615533Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5615679Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5615938Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5616136Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5616245Z     
2025-05-07T20:32:16.5616389Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5616396Z 
2025-05-07T20:32:16.5616541Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5616724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5616875Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5617069Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5617898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5618045Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5618587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5618914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5619470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5619847Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5620403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5620645Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5621153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5621273Z     fn()
2025-05-07T20:32:16.5621864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5621979Z     self.fn.run(
2025-05-07T20:32:16.5622488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5622624Z     kernel = self.compile(
2025-05-07T20:32:16.5623554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5623823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5624003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5624011Z 
2025-05-07T20:32:16.5624311Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cea305f0>
2025-05-07T20:32:16.5625471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5626224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce7c0720>}
2025-05-07T20:32:16.5627373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5627653Z context = <triton._C.libtriton.ir.context object at 0x7f96ce421530>
2025-05-07T20:32:16.5627661Z 
2025-05-07T20:32:16.5627903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5628286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5628537Z                            module_map=module_map)
2025-05-07T20:32:16.5628758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5628904Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5629021Z E       ^
2025-05-07T20:32:16.5629536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5629602Z 
2025-05-07T20:32:16.5630236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5630248Z 
2025-05-07T20:32:16.5630407Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5630730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5647153Z     T=2048,
2025-05-07T20:32:16.5647346Z     D=5120,
2025-05-07T20:32:16.5647541Z     scale_ub=None,
2025-05-07T20:32:16.5647693Z     contiguous=True,
2025-05-07T20:32:16.5647895Z     compiled=True,
2025-05-07T20:32:16.5648045Z )
2025-05-07T20:32:16.5648485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5648776Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5648784Z 
2025-05-07T20:32:16.5648906Z     @given(
2025-05-07T20:32:16.5649086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5649239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5649399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5649582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5649757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5649865Z     )
2025-05-07T20:32:16.5650226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5650374Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5650483Z         self,
2025-05-07T20:32:16.5650593Z         T: int,
2025-05-07T20:32:16.5650718Z         D: int,
2025-05-07T20:32:16.5650871Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5651008Z         contiguous: bool,
2025-05-07T20:32:16.5651139Z         compiled: bool,
2025-05-07T20:32:16.5651259Z     ) -> None:
2025-05-07T20:32:16.5651405Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5651511Z     
2025-05-07T20:32:16.5651878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5652000Z     
2025-05-07T20:32:16.5652131Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5652573Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5652724Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5652881Z         x0 = x[:, :D]
2025-05-07T20:32:16.5652997Z         x1 = x[:, D:]
2025-05-07T20:32:16.5653120Z     
2025-05-07T20:32:16.5653279Z         if contiguous:
2025-05-07T20:32:16.5653413Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5653542Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5653662Z     
2025-05-07T20:32:16.5653791Z         if scale_ub is not None:
2025-05-07T20:32:16.5653936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5654129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5654239Z             )
2025-05-07T20:32:16.5654345Z         else:
2025-05-07T20:32:16.5654487Z             scale_ub_tensor = None
2025-05-07T20:32:16.5654597Z     
2025-05-07T20:32:16.5654790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5654914Z             op = silu_mul_quant
2025-05-07T20:32:16.5655030Z             if compiled:
2025-05-07T20:32:16.5655191Z                 op = torch.compile(op)
2025-05-07T20:32:16.5655336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5655439Z     
2025-05-07T20:32:16.5655578Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5655754Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5655860Z     
2025-05-07T20:32:16.5656064Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5656289Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5656432Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5656618Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5656816Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5656991Z     
2025-05-07T20:32:16.5657138Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5657146Z 
2025-05-07T20:32:16.5657292Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5657498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5657650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5657845Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5658700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5658849Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5659400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5659733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5660281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5660673Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5661235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5661489Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5662001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5662114Z     fn()
2025-05-07T20:32:16.5662733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5662858Z     self.fn.run(
2025-05-07T20:32:16.5663410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5663556Z     kernel = self.compile(
2025-05-07T20:32:16.5664173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5664479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5664867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5664877Z 
2025-05-07T20:32:16.5665277Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce825c10>
2025-05-07T20:32:16.5670893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5671707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf3f05e0>}
2025-05-07T20:32:16.5672815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5673098Z context = <triton._C.libtriton.ir.context object at 0x7f9639bc7970>
2025-05-07T20:32:16.5673106Z 
2025-05-07T20:32:16.5673343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5673727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5673874Z                            module_map=module_map)
2025-05-07T20:32:16.5674101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5674375Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5674481Z E       ^
2025-05-07T20:32:16.5675041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5675049Z 
2025-05-07T20:32:16.5675656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5675722Z 
2025-05-07T20:32:16.5675867Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5676189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5676295Z     T=128,
2025-05-07T20:32:16.5676404Z     D=5120,
2025-05-07T20:32:16.5676517Z     scale_ub=None,
2025-05-07T20:32:16.5697461Z     contiguous=True,
2025-05-07T20:32:16.5697621Z     compiled=True,
2025-05-07T20:32:16.5697727Z )
2025-05-07T20:32:16.5698065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5698313Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5698320Z 
2025-05-07T20:32:16.5698429Z     @given(
2025-05-07T20:32:16.5698598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5698741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5698900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5699078Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5699237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5699343Z     )
2025-05-07T20:32:16.5699718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5699851Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5699973Z         self,
2025-05-07T20:32:16.5700079Z         T: int,
2025-05-07T20:32:16.5700186Z         D: int,
2025-05-07T20:32:16.5700325Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5700452Z         contiguous: bool,
2025-05-07T20:32:16.5700573Z         compiled: bool,
2025-05-07T20:32:16.5700693Z     ) -> None:
2025-05-07T20:32:16.5700829Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5700927Z     
2025-05-07T20:32:16.5701164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5701276Z     
2025-05-07T20:32:16.5701401Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5701582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5701712Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5701824Z         x0 = x[:, :D]
2025-05-07T20:32:16.5702110Z         x1 = x[:, D:]
2025-05-07T20:32:16.5702225Z     
2025-05-07T20:32:16.5702345Z         if contiguous:
2025-05-07T20:32:16.5702477Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5702612Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5702718Z     
2025-05-07T20:32:16.5702855Z         if scale_ub is not None:
2025-05-07T20:32:16.5703003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5703198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5703313Z             )
2025-05-07T20:32:16.5703423Z         else:
2025-05-07T20:32:16.5703558Z             scale_ub_tensor = None
2025-05-07T20:32:16.5703671Z     
2025-05-07T20:32:16.5703854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5703983Z             op = silu_mul_quant
2025-05-07T20:32:16.5704108Z             if compiled:
2025-05-07T20:32:16.5704246Z                 op = torch.compile(op)
2025-05-07T20:32:16.5704395Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5704511Z     
2025-05-07T20:32:16.5704636Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5704820Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5704922Z     
2025-05-07T20:32:16.5705117Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5705273Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5705415Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5705661Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5705872Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5705985Z     
2025-05-07T20:32:16.5706421Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5706430Z 
2025-05-07T20:32:16.5706804Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5707018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5707200Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5707404Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5708236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5708387Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5708917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5709243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5709791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5710163Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5710728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5710968Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5711478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5711600Z     fn()
2025-05-07T20:32:16.5712194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5712314Z     self.fn.run(
2025-05-07T20:32:16.5712812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5712946Z     kernel = self.compile(
2025-05-07T20:32:16.5713533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5713785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5713973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5713979Z 
2025-05-07T20:32:16.5714487Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cea33950>
2025-05-07T20:32:16.5715675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5716440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a92520>}
2025-05-07T20:32:16.5717596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5717877Z context = <triton._C.libtriton.ir.context object at 0x7f9639c46830>
2025-05-07T20:32:16.5717883Z 
2025-05-07T20:32:16.5718116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5718513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5718668Z                            module_map=module_map)
2025-05-07T20:32:16.5718893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5719034Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5719151Z E       ^
2025-05-07T20:32:16.5719665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5719757Z 
2025-05-07T20:32:16.5720377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5720384Z 
2025-05-07T20:32:16.5720530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5720904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5721023Z     T=4096,
2025-05-07T20:32:16.5721131Z     D=5120,
2025-05-07T20:32:16.5721262Z     scale_ub=None,
2025-05-07T20:32:16.5721385Z     contiguous=True,
2025-05-07T20:32:16.5721501Z     compiled=True,
2025-05-07T20:32:16.5721612Z )
2025-05-07T20:32:16.5721929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5722174Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5722181Z 
2025-05-07T20:32:16.5722303Z     @given(
2025-05-07T20:32:16.5722466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5722598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5722762Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5722929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5723104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5723210Z     )
2025-05-07T20:32:16.5723566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5723704Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5723817Z         self,
2025-05-07T20:32:16.5723932Z         T: int,
2025-05-07T20:32:16.5724050Z         D: int,
2025-05-07T20:32:16.5724189Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5724315Z         contiguous: bool,
2025-05-07T20:32:16.5724451Z         compiled: bool,
2025-05-07T20:32:16.5724564Z     ) -> None:
2025-05-07T20:32:16.5724694Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5724807Z     
2025-05-07T20:32:16.5725042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5725151Z     
2025-05-07T20:32:16.5725292Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5725465Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5725596Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5725715Z         x0 = x[:, :D]
2025-05-07T20:32:16.5725829Z         x1 = x[:, D:]
2025-05-07T20:32:16.5725940Z     
2025-05-07T20:32:16.5726063Z         if contiguous:
2025-05-07T20:32:16.5726196Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5726444Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5726554Z     
2025-05-07T20:32:16.5726681Z         if scale_ub is not None:
2025-05-07T20:32:16.5726835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5727028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5727156Z             )
2025-05-07T20:32:16.5727280Z         else:
2025-05-07T20:32:16.5727438Z             scale_ub_tensor = None
2025-05-07T20:32:16.5727553Z     
2025-05-07T20:32:16.5727731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5727853Z             op = silu_mul_quant
2025-05-07T20:32:16.5727977Z             if compiled:
2025-05-07T20:32:16.5728114Z                 op = torch.compile(op)
2025-05-07T20:32:16.5728265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5728377Z     
2025-05-07T20:32:16.5728504Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5728679Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5728787Z     
2025-05-07T20:32:16.5728975Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5729122Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5729270Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5729447Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5729653Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5729818Z     
2025-05-07T20:32:16.5729959Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5729965Z 
2025-05-07T20:32:16.5730108Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5730290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5730494Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5730694Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5731529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5731679Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5732292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5732616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5733174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5733540Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5734102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5734342Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5734856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5734972Z     fn()
2025-05-07T20:32:16.5735563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5735680Z     self.fn.run(
2025-05-07T20:32:16.5736180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5736310Z     kernel = self.compile(
2025-05-07T20:32:16.5736875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5737126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5737314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5737324Z 
2025-05-07T20:32:16.5737655Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf4053a0>
2025-05-07T20:32:16.5738948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5739712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a55300>}
2025-05-07T20:32:16.5740823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5741110Z context = <triton._C.libtriton.ir.context object at 0x7f9639db4c30>
2025-05-07T20:32:16.5741117Z 
2025-05-07T20:32:16.5741367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5741765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5741931Z                            module_map=module_map)
2025-05-07T20:32:16.5742162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5742308Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5742423Z E       ^
2025-05-07T20:32:16.5742954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5742961Z 
2025-05-07T20:32:16.5743653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5743668Z 
2025-05-07T20:32:16.5743818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5744138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5744323Z     T=16384,
2025-05-07T20:32:16.5744430Z     D=5120,
2025-05-07T20:32:16.5744543Z     scale_ub=None,
2025-05-07T20:32:16.5744670Z     contiguous=True,
2025-05-07T20:32:16.5744790Z     compiled=True,
2025-05-07T20:32:16.5744892Z )
2025-05-07T20:32:16.5745231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5745478Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.5745484Z 
2025-05-07T20:32:16.5745599Z     @given(
2025-05-07T20:32:16.5745762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5745907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5746083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5746248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5746405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5746518Z     )
2025-05-07T20:32:16.5746875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5747012Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5747139Z         self,
2025-05-07T20:32:16.5747267Z         T: int,
2025-05-07T20:32:16.5747405Z         D: int,
2025-05-07T20:32:16.5747548Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5747676Z         contiguous: bool,
2025-05-07T20:32:16.5747801Z         compiled: bool,
2025-05-07T20:32:16.5747912Z     ) -> None:
2025-05-07T20:32:16.5748046Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5748154Z     
2025-05-07T20:32:16.5748379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5748486Z     
2025-05-07T20:32:16.5748622Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5748794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5748915Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5749032Z         x0 = x[:, :D]
2025-05-07T20:32:16.5749150Z         x1 = x[:, D:]
2025-05-07T20:32:16.5749256Z     
2025-05-07T20:32:16.5749385Z         if contiguous:
2025-05-07T20:32:16.5749508Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5749638Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5749742Z     
2025-05-07T20:32:16.5749974Z         if scale_ub is not None:
2025-05-07T20:32:16.5750134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5750327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5750436Z             )
2025-05-07T20:32:16.5750555Z         else:
2025-05-07T20:32:16.5750685Z             scale_ub_tensor = None
2025-05-07T20:32:16.5750785Z     
2025-05-07T20:32:16.5750973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5751104Z             op = silu_mul_quant
2025-05-07T20:32:16.5751225Z             if compiled:
2025-05-07T20:32:16.5751368Z                 op = torch.compile(op)
2025-05-07T20:32:16.5751508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5751612Z     
2025-05-07T20:32:16.5751733Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5751896Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5751997Z     
2025-05-07T20:32:16.5752177Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5752326Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5752471Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5752639Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5752828Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5752933Z     
2025-05-07T20:32:16.5753069Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5753145Z 
2025-05-07T20:32:16.5753282Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5753454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5753593Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5753788Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5754714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5754858Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5755431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5755768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5756342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5756733Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5757324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5757583Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5758102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5758215Z     fn()
2025-05-07T20:32:16.5758854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5758975Z     self.fn.run(
2025-05-07T20:32:16.5759520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5759659Z     kernel = self.compile(
2025-05-07T20:32:16.5760216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5760464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5760642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5760650Z 
2025-05-07T20:32:16.5760948Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175bb0>
2025-05-07T20:32:16.5762335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5763144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96391f8e00>}
2025-05-07T20:32:16.5764332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5764624Z context = <triton._C.libtriton.ir.context object at 0x7f9638cb5ff0>
2025-05-07T20:32:16.5764632Z 
2025-05-07T20:32:16.5764875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5765279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5765436Z                            module_map=module_map)
2025-05-07T20:32:16.5765668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5765822Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5765936Z E       ^
2025-05-07T20:32:16.5766472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5766481Z 
2025-05-07T20:32:16.5767144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5767230Z 
2025-05-07T20:32:16.5767415Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5767758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5767880Z     T=1,
2025-05-07T20:32:16.5767990Z     D=5120,
2025-05-07T20:32:16.5768106Z     scale_ub=1200.0,
2025-05-07T20:32:16.5768230Z     contiguous=True,
2025-05-07T20:32:16.5768407Z     compiled=True,
2025-05-07T20:32:16.5768505Z )
2025-05-07T20:32:16.5768827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5769073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.5769080Z 
2025-05-07T20:32:16.5769191Z     @given(
2025-05-07T20:32:16.5769366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5769509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5769677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5769852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5770017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5770130Z     )
2025-05-07T20:32:16.5770507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5770651Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5770766Z         self,
2025-05-07T20:32:16.5770875Z         T: int,
2025-05-07T20:32:16.5770985Z         D: int,
2025-05-07T20:32:16.5771131Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5771256Z         contiguous: bool,
2025-05-07T20:32:16.5771374Z         compiled: bool,
2025-05-07T20:32:16.5771508Z     ) -> None:
2025-05-07T20:32:16.5771645Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5771749Z     
2025-05-07T20:32:16.5772157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5772265Z     
2025-05-07T20:32:16.5772404Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5772581Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5772713Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5772834Z         x0 = x[:, :D]
2025-05-07T20:32:16.5772945Z         x1 = x[:, D:]
2025-05-07T20:32:16.5773042Z     
2025-05-07T20:32:16.5773168Z         if contiguous:
2025-05-07T20:32:16.5773294Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5773414Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5773527Z     
2025-05-07T20:32:16.5773652Z         if scale_ub is not None:
2025-05-07T20:32:16.5773794Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5774097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5774206Z             )
2025-05-07T20:32:16.5774318Z         else:
2025-05-07T20:32:16.5774447Z             scale_ub_tensor = None
2025-05-07T20:32:16.5774544Z     
2025-05-07T20:32:16.5774733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5774856Z             op = silu_mul_quant
2025-05-07T20:32:16.5774971Z             if compiled:
2025-05-07T20:32:16.5775121Z                 op = torch.compile(op)
2025-05-07T20:32:16.5775268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5775370Z     
2025-05-07T20:32:16.5775498Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5775504Z 
2025-05-07T20:32:16.5775640Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5775831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5775980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5776117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5776725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5776862Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5777689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5777837Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5778385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5778802Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5779400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5779701Z     kernel = self.compile(
2025-05-07T20:32:16.5780311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5780571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5780763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5780771Z 
2025-05-07T20:32:16.5781067Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175190>
2025-05-07T20:32:16.5782295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5783123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cccc0>}
2025-05-07T20:32:16.5784311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5784619Z context = <triton._C.libtriton.ir.context object at 0x7f963849dc70>
2025-05-07T20:32:16.5784627Z 
2025-05-07T20:32:16.5784878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5785306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5785465Z                            module_map=module_map)
2025-05-07T20:32:16.5785708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5785865Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5785977Z E       ^
2025-05-07T20:32:16.5786529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5786552Z 
2025-05-07T20:32:16.5787190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5787198Z 
2025-05-07T20:32:16.5787351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5787793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5787910Z     T=1,
2025-05-07T20:32:16.5788016Z     D=5120,
2025-05-07T20:32:16.5788137Z     scale_ub=None,
2025-05-07T20:32:16.5788259Z     contiguous=False,
2025-05-07T20:32:16.5788379Z     compiled=True,
2025-05-07T20:32:16.5788489Z )
2025-05-07T20:32:16.5788813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5789066Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.5789074Z 
2025-05-07T20:32:16.5789180Z     @given(
2025-05-07T20:32:16.5789346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5789494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5789664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5789831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5790012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5790120Z     )
2025-05-07T20:32:16.5790498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5790641Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5790749Z         self,
2025-05-07T20:32:16.5790865Z         T: int,
2025-05-07T20:32:16.5790973Z         D: int,
2025-05-07T20:32:16.5791108Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5791311Z         contiguous: bool,
2025-05-07T20:32:16.5791434Z         compiled: bool,
2025-05-07T20:32:16.5791544Z     ) -> None:
2025-05-07T20:32:16.5791683Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5791788Z     
2025-05-07T20:32:16.5792040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5792209Z     
2025-05-07T20:32:16.5792343Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5792519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5792654Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5792774Z         x0 = x[:, :D]
2025-05-07T20:32:16.5792883Z         x1 = x[:, D:]
2025-05-07T20:32:16.5792992Z     
2025-05-07T20:32:16.5793110Z         if contiguous:
2025-05-07T20:32:16.5793244Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5793368Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5793473Z     
2025-05-07T20:32:16.5793607Z         if scale_ub is not None:
2025-05-07T20:32:16.5793756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5793956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5794072Z             )
2025-05-07T20:32:16.5794179Z         else:
2025-05-07T20:32:16.5794311Z             scale_ub_tensor = None
2025-05-07T20:32:16.5794424Z     
2025-05-07T20:32:16.5794607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5794739Z             op = silu_mul_quant
2025-05-07T20:32:16.5794861Z             if compiled:
2025-05-07T20:32:16.5795002Z                 op = torch.compile(op)
2025-05-07T20:32:16.5795165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5795272Z     
2025-05-07T20:32:16.5795399Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5795578Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5795681Z     
2025-05-07T20:32:16.5795876Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5796025Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5796169Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5796343Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5796561Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5796664Z     
2025-05-07T20:32:16.5796817Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5796828Z 
2025-05-07T20:32:16.5796967Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5797175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5797464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5797666Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5798561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5798719Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5799278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5799610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5800188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5800584Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5801198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5801457Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5801999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5802112Z     fn()
2025-05-07T20:32:16.5802730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5802936Z     self.fn.run(
2025-05-07T20:32:16.5803465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5803608Z     kernel = self.compile(
2025-05-07T20:32:16.5804213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5804573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5804773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5804788Z 
2025-05-07T20:32:16.5805090Z self = <triton.compiler.compiler.ASTSource object at 0x7f96394c87d0>
2025-05-07T20:32:16.5806945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5807629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cf2e0>}
2025-05-07T20:32:16.5808418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5808628Z context = <triton._C.libtriton.ir.context object at 0x7f96384f5cf0>
2025-05-07T20:32:16.5808634Z 
2025-05-07T20:32:16.5808814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5809089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5809202Z                            module_map=module_map)
2025-05-07T20:32:16.5809368Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5809479Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5809558Z E       ^
2025-05-07T20:32:16.5809928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5809933Z 
2025-05-07T20:32:16.5810368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5810376Z 
2025-05-07T20:32:16.5810481Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5810718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5810794Z     T=1,
2025-05-07T20:32:16.5811247Z     D=5120,
2025-05-07T20:32:16.5811343Z     scale_ub=None,
2025-05-07T20:32:16.5811427Z     contiguous=True,
2025-05-07T20:32:16.5811509Z     compiled=False,
2025-05-07T20:32:16.5811585Z )
2025-05-07T20:32:16.5811892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5812064Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.5812071Z 
2025-05-07T20:32:16.5812150Z     @given(
2025-05-07T20:32:16.5812268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5812367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5812485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5812599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5812716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5812789Z     )
2025-05-07T20:32:16.5813040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5813144Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5813221Z         self,
2025-05-07T20:32:16.5813297Z         T: int,
2025-05-07T20:32:16.5813376Z         D: int,
2025-05-07T20:32:16.5813473Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5813560Z         contiguous: bool,
2025-05-07T20:32:16.5813651Z         compiled: bool,
2025-05-07T20:32:16.5813730Z     ) -> None:
2025-05-07T20:32:16.5813910Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5813988Z     
2025-05-07T20:32:16.5814160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5814239Z     
2025-05-07T20:32:16.5814331Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5814453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5814614Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5814692Z         x0 = x[:, :D]
2025-05-07T20:32:16.5814770Z         x1 = x[:, D:]
2025-05-07T20:32:16.5814849Z     
2025-05-07T20:32:16.5814930Z         if contiguous:
2025-05-07T20:32:16.5815026Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5815121Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5815188Z     
2025-05-07T20:32:16.5815276Z         if scale_ub is not None:
2025-05-07T20:32:16.5815387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5815522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5815606Z             )
2025-05-07T20:32:16.5815679Z         else:
2025-05-07T20:32:16.5815772Z             scale_ub_tensor = None
2025-05-07T20:32:16.5815845Z     
2025-05-07T20:32:16.5815972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5816060Z             op = silu_mul_quant
2025-05-07T20:32:16.5816152Z             if compiled:
2025-05-07T20:32:16.5816252Z                 op = torch.compile(op)
2025-05-07T20:32:16.5816357Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5816435Z     
2025-05-07T20:32:16.5816523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5816532Z 
2025-05-07T20:32:16.5816629Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5816765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5816866Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5816972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5817491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5817593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5817972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5818202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5818561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5818654Z     kernel = self.compile(
2025-05-07T20:32:16.5819133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5819318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5819449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5819455Z 
2025-05-07T20:32:16.5819662Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396ae750>
2025-05-07T20:32:16.5820483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5820998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96394cfb00>}
2025-05-07T20:32:16.5821791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5821985Z context = <triton._C.libtriton.ir.context object at 0x7f963827de30>
2025-05-07T20:32:16.5821990Z 
2025-05-07T20:32:16.5822165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5822439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5822586Z                            module_map=module_map)
2025-05-07T20:32:16.5822759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5822858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5822975Z E       ^
2025-05-07T20:32:16.5823353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5823358Z 
2025-05-07T20:32:16.5823797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5823801Z 
2025-05-07T20:32:16.5823910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5824140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5824214Z     T=128,
2025-05-07T20:32:16.5824294Z     D=5120,
2025-05-07T20:32:16.5824376Z     scale_ub=None,
2025-05-07T20:32:16.5824463Z     contiguous=False,
2025-05-07T20:32:16.5824551Z     compiled=True,
2025-05-07T20:32:16.5824619Z )
2025-05-07T20:32:16.5824850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5825027Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.5825034Z 
2025-05-07T20:32:16.5825113Z     @given(
2025-05-07T20:32:16.5825239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5825339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5825462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5825592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5825703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5825776Z     )
2025-05-07T20:32:16.5826037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5826128Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5826215Z         self,
2025-05-07T20:32:16.5826288Z         T: int,
2025-05-07T20:32:16.5826364Z         D: int,
2025-05-07T20:32:16.5826468Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5826556Z         contiguous: bool,
2025-05-07T20:32:16.5826640Z         compiled: bool,
2025-05-07T20:32:16.5826721Z     ) -> None:
2025-05-07T20:32:16.5826816Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5826884Z     
2025-05-07T20:32:16.5827072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5827157Z     
2025-05-07T20:32:16.5827353Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5827487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5827574Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5827658Z         x0 = x[:, :D]
2025-05-07T20:32:16.5827739Z         x1 = x[:, D:]
2025-05-07T20:32:16.5827808Z     
2025-05-07T20:32:16.5827895Z         if contiguous:
2025-05-07T20:32:16.5827984Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5828073Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5828151Z     
2025-05-07T20:32:16.5828236Z         if scale_ub is not None:
2025-05-07T20:32:16.5828338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5828480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5828552Z             )
2025-05-07T20:32:16.5828630Z         else:
2025-05-07T20:32:16.5828728Z             scale_ub_tensor = None
2025-05-07T20:32:16.5828800Z     
2025-05-07T20:32:16.5828927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5829031Z             op = silu_mul_quant
2025-05-07T20:32:16.5829112Z             if compiled:
2025-05-07T20:32:16.5829215Z                 op = torch.compile(op)
2025-05-07T20:32:16.5829321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5829389Z     
2025-05-07T20:32:16.5829483Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5829488Z 
2025-05-07T20:32:16.5829583Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5829763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5829867Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5829968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5830356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5830490Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5831008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5831112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5831486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5831718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5832077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5832171Z     kernel = self.compile(
2025-05-07T20:32:16.5832575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5832754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5832885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5832889Z 
2025-05-07T20:32:16.5833103Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f5df0>
2025-05-07T20:32:16.5833913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5834437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959c360>}
2025-05-07T20:32:16.5835212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5835403Z context = <triton._C.libtriton.ir.context object at 0x7f9639094df0>
2025-05-07T20:32:16.5835418Z 
2025-05-07T20:32:16.5835585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5835940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5836054Z                            module_map=module_map)
2025-05-07T20:32:16.5836219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5836319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5836403Z E       ^
2025-05-07T20:32:16.5836771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5836778Z 
2025-05-07T20:32:16.5837214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5837218Z 
2025-05-07T20:32:16.5837322Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5837554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5837642Z     T=128,
2025-05-07T20:32:16.5837719Z     D=7168,
2025-05-07T20:32:16.5837804Z     scale_ub=1200.0,
2025-05-07T20:32:16.5837898Z     contiguous=False,
2025-05-07T20:32:16.5837987Z     compiled=False,
2025-05-07T20:32:16.5838061Z )
2025-05-07T20:32:16.5838292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5838473Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.5838477Z 
2025-05-07T20:32:16.5838567Z     @given(
2025-05-07T20:32:16.5838688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5838870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5838993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5839113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5839230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5846390Z     )
2025-05-07T20:32:16.5846691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5846785Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5846864Z         self,
2025-05-07T20:32:16.5846959Z         T: int,
2025-05-07T20:32:16.5847037Z         D: int,
2025-05-07T20:32:16.5847136Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5847237Z         contiguous: bool,
2025-05-07T20:32:16.5847325Z         compiled: bool,
2025-05-07T20:32:16.5847411Z     ) -> None:
2025-05-07T20:32:16.5847508Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5847582Z     
2025-05-07T20:32:16.5847768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5847844Z     
2025-05-07T20:32:16.5847939Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5848075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5848166Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5848249Z         x0 = x[:, :D]
2025-05-07T20:32:16.5848342Z         x1 = x[:, D:]
2025-05-07T20:32:16.5848415Z     
2025-05-07T20:32:16.5848501Z         if contiguous:
2025-05-07T20:32:16.5848605Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5848696Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5848780Z     
2025-05-07T20:32:16.5848873Z         if scale_ub is not None:
2025-05-07T20:32:16.5848981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5849126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5849202Z             )
2025-05-07T20:32:16.5849279Z         else:
2025-05-07T20:32:16.5849381Z             scale_ub_tensor = None
2025-05-07T20:32:16.5849458Z     
2025-05-07T20:32:16.5849590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5849690Z             op = silu_mul_quant
2025-05-07T20:32:16.5849779Z             if compiled:
2025-05-07T20:32:16.5849881Z                 op = torch.compile(op)
2025-05-07T20:32:16.5849990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5850067Z     
2025-05-07T20:32:16.5850158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5850163Z 
2025-05-07T20:32:16.5850264Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5850535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5850638Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5850747Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5851459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5851603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5852194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5852503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5852911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5853018Z     kernel = self.compile(
2025-05-07T20:32:16.5853417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5853607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5853739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5853744Z 
2025-05-07T20:32:16.5853955Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f7fe0>
2025-05-07T20:32:16.5854780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5855433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959cae0>}
2025-05-07T20:32:16.5856269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5856464Z context = <triton._C.libtriton.ir.context object at 0x7f963871cf70>
2025-05-07T20:32:16.5856468Z 
2025-05-07T20:32:16.5856648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5856924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5857046Z                            module_map=module_map)
2025-05-07T20:32:16.5857253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5857358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5857438Z E       ^
2025-05-07T20:32:16.5857811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5857819Z 
2025-05-07T20:32:16.5858247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5858252Z 
2025-05-07T20:32:16.5858368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5858598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5858674Z     T=128,
2025-05-07T20:32:16.5858758Z     D=5120,
2025-05-07T20:32:16.5858841Z     scale_ub=None,
2025-05-07T20:32:16.5858927Z     contiguous=False,
2025-05-07T20:32:16.5859018Z     compiled=False,
2025-05-07T20:32:16.5859094Z )
2025-05-07T20:32:16.5859320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5859505Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5859510Z 
2025-05-07T20:32:16.5859586Z     @given(
2025-05-07T20:32:16.5859712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5859816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5859931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5860054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5860254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5860332Z     )
2025-05-07T20:32:16.5860589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5860681Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5860765Z         self,
2025-05-07T20:32:16.5860841Z         T: int,
2025-05-07T20:32:16.5860918Z         D: int,
2025-05-07T20:32:16.5861024Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5861112Z         contiguous: bool,
2025-05-07T20:32:16.5861198Z         compiled: bool,
2025-05-07T20:32:16.5861282Z     ) -> None:
2025-05-07T20:32:16.5861376Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5861451Z     
2025-05-07T20:32:16.5861627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5861705Z     
2025-05-07T20:32:16.5861796Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5861927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5862020Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5862101Z         x0 = x[:, :D]
2025-05-07T20:32:16.5862186Z         x1 = x[:, D:]
2025-05-07T20:32:16.5862260Z     
2025-05-07T20:32:16.5862352Z         if contiguous:
2025-05-07T20:32:16.5862448Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5862539Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5862618Z     
2025-05-07T20:32:16.5862710Z         if scale_ub is not None:
2025-05-07T20:32:16.5862869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5863013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5863088Z             )
2025-05-07T20:32:16.5863164Z         else:
2025-05-07T20:32:16.5863263Z             scale_ub_tensor = None
2025-05-07T20:32:16.5863379Z     
2025-05-07T20:32:16.5863510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5863606Z             op = silu_mul_quant
2025-05-07T20:32:16.5863690Z             if compiled:
2025-05-07T20:32:16.5863800Z                 op = torch.compile(op)
2025-05-07T20:32:16.5863906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5863978Z     
2025-05-07T20:32:16.5864074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5864078Z 
2025-05-07T20:32:16.5864175Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5864311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5864420Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5864524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5865038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5865142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5865515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5865748Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5866103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5866199Z     kernel = self.compile(
2025-05-07T20:32:16.5866600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5866778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5866918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5866922Z 
2025-05-07T20:32:16.5867134Z self = <triton.compiler.compiler.ASTSource object at 0x7f96395f7ce0>
2025-05-07T20:32:16.5867991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5868602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959dd00>}
2025-05-07T20:32:16.5869385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5869587Z context = <triton._C.libtriton.ir.context object at 0x7f963838d3f0>
2025-05-07T20:32:16.5869594Z 
2025-05-07T20:32:16.5869763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5870037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5870149Z                            module_map=module_map)
2025-05-07T20:32:16.5870318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5870427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5870503Z E       ^
2025-05-07T20:32:16.5870874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5870880Z 
2025-05-07T20:32:16.5871320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5871325Z 
2025-05-07T20:32:16.5871428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5871712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5871792Z     T=128,
2025-05-07T20:32:16.5871868Z     D=5120,
2025-05-07T20:32:16.5871960Z     scale_ub=1200.0,
2025-05-07T20:32:16.5872046Z     contiguous=True,
2025-05-07T20:32:16.5872132Z     compiled=False,
2025-05-07T20:32:16.5872212Z )
2025-05-07T20:32:16.5872482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5872659Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.5872663Z 
2025-05-07T20:32:16.5872751Z     @given(
2025-05-07T20:32:16.5872875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5872982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5873099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5873217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5873338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5873419Z     )
2025-05-07T20:32:16.5873671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5873771Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5873847Z         self,
2025-05-07T20:32:16.5873923Z         T: int,
2025-05-07T20:32:16.5874007Z         D: int,
2025-05-07T20:32:16.5874107Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5874196Z         contiguous: bool,
2025-05-07T20:32:16.5874287Z         compiled: bool,
2025-05-07T20:32:16.5874365Z     ) -> None:
2025-05-07T20:32:16.5874496Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5874602Z     
2025-05-07T20:32:16.5874832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5874939Z     
2025-05-07T20:32:16.5875032Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5875159Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5875253Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5875332Z         x0 = x[:, :D]
2025-05-07T20:32:16.5875416Z         x1 = x[:, D:]
2025-05-07T20:32:16.5875494Z     
2025-05-07T20:32:16.5875577Z         if contiguous:
2025-05-07T20:32:16.5875668Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5875762Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5875833Z     
2025-05-07T20:32:16.5875925Z         if scale_ub is not None:
2025-05-07T20:32:16.5876041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5876179Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5876261Z             )
2025-05-07T20:32:16.5876480Z         else:
2025-05-07T20:32:16.5876577Z             scale_ub_tensor = None
2025-05-07T20:32:16.5876654Z     
2025-05-07T20:32:16.5876784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5876874Z             op = silu_mul_quant
2025-05-07T20:32:16.5876965Z             if compiled:
2025-05-07T20:32:16.5877064Z                 op = torch.compile(op)
2025-05-07T20:32:16.5877173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5877256Z     
2025-05-07T20:32:16.5877347Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5877352Z 
2025-05-07T20:32:16.5877457Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5877589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5877692Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5877797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5878319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5878418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5878794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5879022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5879381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5879523Z     kernel = self.compile(
2025-05-07T20:32:16.5879919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5880105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5880281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5880285Z 
2025-05-07T20:32:16.5880498Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900df10>
2025-05-07T20:32:16.5881322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5881839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963959ede0>}
2025-05-07T20:32:16.5882623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5882815Z context = <triton._C.libtriton.ir.context object at 0x7f96385efdb0>
2025-05-07T20:32:16.5882822Z 
2025-05-07T20:32:16.5882997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5883274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5883381Z                            module_map=module_map)
2025-05-07T20:32:16.5883551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5883650Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5883726Z E       ^
2025-05-07T20:32:16.5884097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5884103Z 
2025-05-07T20:32:16.5884531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5884536Z 
2025-05-07T20:32:16.5884645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5884878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5884956Z     T=1,
2025-05-07T20:32:16.5885040Z     D=7168,
2025-05-07T20:32:16.5885123Z     scale_ub=1200.0,
2025-05-07T20:32:16.5885208Z     contiguous=True,
2025-05-07T20:32:16.5885379Z     compiled=True,
2025-05-07T20:32:16.5885455Z )
2025-05-07T20:32:16.5885711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5885945Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.5885953Z 
2025-05-07T20:32:16.5886066Z     @given(
2025-05-07T20:32:16.5886209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5886318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5886434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5886560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5886677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5886755Z     )
2025-05-07T20:32:16.5887019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5887112Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5887197Z         self,
2025-05-07T20:32:16.5887279Z         T: int,
2025-05-07T20:32:16.5887358Z         D: int,
2025-05-07T20:32:16.5887462Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5887551Z         contiguous: bool,
2025-05-07T20:32:16.5887639Z         compiled: bool,
2025-05-07T20:32:16.5887724Z     ) -> None:
2025-05-07T20:32:16.5887819Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5887893Z     
2025-05-07T20:32:16.5888072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5888209Z     
2025-05-07T20:32:16.5888300Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5888441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5888530Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5888618Z         x0 = x[:, :D]
2025-05-07T20:32:16.5888742Z         x1 = x[:, D:]
2025-05-07T20:32:16.5888814Z     
2025-05-07T20:32:16.5888904Z         if contiguous:
2025-05-07T20:32:16.5888995Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5889083Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5889163Z     
2025-05-07T20:32:16.5889253Z         if scale_ub is not None:
2025-05-07T20:32:16.5889358Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5889504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5889579Z             )
2025-05-07T20:32:16.5889656Z         else:
2025-05-07T20:32:16.5889755Z             scale_ub_tensor = None
2025-05-07T20:32:16.5889830Z     
2025-05-07T20:32:16.5889959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5890057Z             op = silu_mul_quant
2025-05-07T20:32:16.5890141Z             if compiled:
2025-05-07T20:32:16.5890248Z                 op = torch.compile(op)
2025-05-07T20:32:16.5890353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5890428Z     
2025-05-07T20:32:16.5890529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5890533Z 
2025-05-07T20:32:16.5890630Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5890766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5890873Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5890973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5891358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5891451Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5892085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5892192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5892563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5892796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5893154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5893351Z     kernel = self.compile(
2025-05-07T20:32:16.5893757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5893937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5894069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5894076Z 
2025-05-07T20:32:16.5894294Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900c920>
2025-05-07T20:32:16.5895101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5895628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385ac4a0>}
2025-05-07T20:32:16.5896411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5896605Z context = <triton._C.libtriton.ir.context object at 0x7f9638581170>
2025-05-07T20:32:16.5896616Z 
2025-05-07T20:32:16.5896788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5897149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5897269Z                            module_map=module_map)
2025-05-07T20:32:16.5897439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5897583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5897666Z E       ^
2025-05-07T20:32:16.5898033Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5898043Z 
2025-05-07T20:32:16.5898477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5898482Z 
2025-05-07T20:32:16.5898584Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5898814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5898898Z     T=1,
2025-05-07T20:32:16.5898976Z     D=7168,
2025-05-07T20:32:16.5899058Z     scale_ub=1200.0,
2025-05-07T20:32:16.5899151Z     contiguous=False,
2025-05-07T20:32:16.5899234Z     compiled=True,
2025-05-07T20:32:16.5899307Z )
2025-05-07T20:32:16.5899541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5899713Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.5899720Z 
2025-05-07T20:32:16.5899803Z     @given(
2025-05-07T20:32:16.5899926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5900029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5900152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5900270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5900384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5900466Z     )
2025-05-07T20:32:16.5900722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5900819Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5900904Z         self,
2025-05-07T20:32:16.5900984Z         T: int,
2025-05-07T20:32:16.5901072Z         D: int,
2025-05-07T20:32:16.5901172Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5901265Z         contiguous: bool,
2025-05-07T20:32:16.5901362Z         compiled: bool,
2025-05-07T20:32:16.5901446Z     ) -> None:
2025-05-07T20:32:16.5901542Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5901627Z     
2025-05-07T20:32:16.5901799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5901962Z     
2025-05-07T20:32:16.5902064Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5902194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5902288Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5902379Z         x0 = x[:, :D]
2025-05-07T20:32:16.5902462Z         x1 = x[:, D:]
2025-05-07T20:32:16.5902542Z     
2025-05-07T20:32:16.5902628Z         if contiguous:
2025-05-07T20:32:16.5902725Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5902824Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5902898Z     
2025-05-07T20:32:16.5902989Z         if scale_ub is not None:
2025-05-07T20:32:16.5903108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5903246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5903324Z             )
2025-05-07T20:32:16.5903424Z         else:
2025-05-07T20:32:16.5903519Z             scale_ub_tensor = None
2025-05-07T20:32:16.5903592Z     
2025-05-07T20:32:16.5903737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5903830Z             op = silu_mul_quant
2025-05-07T20:32:16.5903917Z             if compiled:
2025-05-07T20:32:16.5904025Z                 op = torch.compile(op)
2025-05-07T20:32:16.5904131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5904209Z     
2025-05-07T20:32:16.5904301Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5904352Z 
2025-05-07T20:32:16.5904451Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5904591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5904712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5904823Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5905203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5905343Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5905869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5905968Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5907616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5907969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5908370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5908475Z     kernel = self.compile(
2025-05-07T20:32:16.5908883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5909067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5909226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5909233Z 
2025-05-07T20:32:16.5909461Z self = <triton.compiler.compiler.ASTSource object at 0x7f963900fe00>
2025-05-07T20:32:16.5910279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5910805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385adb20>}
2025-05-07T20:32:16.5911596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5911797Z context = <triton._C.libtriton.ir.context object at 0x7f9638785bb0>
2025-05-07T20:32:16.5911803Z 
2025-05-07T20:32:16.5911973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5912628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5912743Z                            module_map=module_map)
2025-05-07T20:32:16.5912917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5913018Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5913097Z E       ^
2025-05-07T20:32:16.5913476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5913485Z 
2025-05-07T20:32:16.5913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5913922Z 
2025-05-07T20:32:16.5914028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5914269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5914348Z     T=1,
2025-05-07T20:32:16.5914433Z     D=7168,
2025-05-07T20:32:16.5914523Z     scale_ub=None,
2025-05-07T20:32:16.5914613Z     contiguous=False,
2025-05-07T20:32:16.5914706Z     compiled=True,
2025-05-07T20:32:16.5914786Z )
2025-05-07T20:32:16.5915014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5915189Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.5915194Z 
2025-05-07T20:32:16.5915273Z     @given(
2025-05-07T20:32:16.5915478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5915583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5915699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5915823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5915938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5916093Z     )
2025-05-07T20:32:16.5916356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5916453Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5916539Z         self,
2025-05-07T20:32:16.5916629Z         T: int,
2025-05-07T20:32:16.5916708Z         D: int,
2025-05-07T20:32:16.5916809Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5916907Z         contiguous: bool,
2025-05-07T20:32:16.5916997Z         compiled: bool,
2025-05-07T20:32:16.5917082Z     ) -> None:
2025-05-07T20:32:16.5917187Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5917266Z     
2025-05-07T20:32:16.5917447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5917523Z     
2025-05-07T20:32:16.5917620Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5917757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5917848Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5917936Z         x0 = x[:, :D]
2025-05-07T20:32:16.5918026Z         x1 = x[:, D:]
2025-05-07T20:32:16.5918101Z     
2025-05-07T20:32:16.5918191Z         if contiguous:
2025-05-07T20:32:16.5918296Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5918395Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5918472Z     
2025-05-07T20:32:16.5918573Z         if scale_ub is not None:
2025-05-07T20:32:16.5918683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5918830Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5918910Z             )
2025-05-07T20:32:16.5918991Z         else:
2025-05-07T20:32:16.5919095Z             scale_ub_tensor = None
2025-05-07T20:32:16.5919171Z     
2025-05-07T20:32:16.5919304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5919403Z             op = silu_mul_quant
2025-05-07T20:32:16.5919491Z             if compiled:
2025-05-07T20:32:16.5919593Z                 op = torch.compile(op)
2025-05-07T20:32:16.5919709Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5919785Z     
2025-05-07T20:32:16.5919879Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.5920009Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.5920169Z     
2025-05-07T20:32:16.5920319Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5920424Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.5920526Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.5920657Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.5920802Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5920884Z     
2025-05-07T20:32:16.5920994Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.5920998Z 
2025-05-07T20:32:16.5921100Z moe/activation_test.py:126: 
2025-05-07T20:32:16.5921237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5921355Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.5921494Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.5922088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.5922193Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.5922568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5922804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5923229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.5923500Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.5923891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.5924103Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.5924466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.5924549Z     fn()
2025-05-07T20:32:16.5924966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.5925060Z     self.fn.run(
2025-05-07T20:32:16.5925410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5925512Z     kernel = self.compile(
2025-05-07T20:32:16.5925913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5926093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5926236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5926244Z 
2025-05-07T20:32:16.5926457Z self = <triton.compiler.compiler.ASTSource object at 0x7f96383780e0>
2025-05-07T20:32:16.5927286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5927807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385ae840>}
2025-05-07T20:32:16.5928592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5928799Z context = <triton._C.libtriton.ir.context object at 0x7f96383ac0f0>
2025-05-07T20:32:16.5928804Z 
2025-05-07T20:32:16.5928974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5929262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5929451Z                            module_map=module_map)
2025-05-07T20:32:16.5929619Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5929729Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.5929808Z E       ^
2025-05-07T20:32:16.5930183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5930196Z 
2025-05-07T20:32:16.5930638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5930642Z 
2025-05-07T20:32:16.5930746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5930985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5931065Z     T=1,
2025-05-07T20:32:16.5931144Z     D=5120,
2025-05-07T20:32:16.5931236Z     scale_ub=1200.0,
2025-05-07T20:32:16.5931325Z     contiguous=False,
2025-05-07T20:32:16.5931408Z     compiled=True,
2025-05-07T20:32:16.5931489Z )
2025-05-07T20:32:16.5931722Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5932001Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.5932006Z 
2025-05-07T20:32:16.5932085Z     @given(
2025-05-07T20:32:16.5932207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5932312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5932474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5932593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5932713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5932789Z     )
2025-05-07T20:32:16.5933051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5933190Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5933266Z         self,
2025-05-07T20:32:16.5933348Z         T: int,
2025-05-07T20:32:16.5933424Z         D: int,
2025-05-07T20:32:16.5933529Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5933624Z         contiguous: bool,
2025-05-07T20:32:16.5933710Z         compiled: bool,
2025-05-07T20:32:16.5933788Z     ) -> None:
2025-05-07T20:32:16.5933889Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5933961Z     
2025-05-07T20:32:16.5934136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5934216Z     
2025-05-07T20:32:16.5934313Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5934438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5934533Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5934612Z         x0 = x[:, :D]
2025-05-07T20:32:16.5934698Z         x1 = x[:, D:]
2025-05-07T20:32:16.5934771Z     
2025-05-07T20:32:16.5934858Z         if contiguous:
2025-05-07T20:32:16.5934957Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5935047Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5935120Z     
2025-05-07T20:32:16.5935219Z         if scale_ub is not None:
2025-05-07T20:32:16.5935328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5935467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5935548Z             )
2025-05-07T20:32:16.5935623Z         else:
2025-05-07T20:32:16.5935718Z             scale_ub_tensor = None
2025-05-07T20:32:16.5935796Z     
2025-05-07T20:32:16.5935929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5936028Z             op = silu_mul_quant
2025-05-07T20:32:16.5936113Z             if compiled:
2025-05-07T20:32:16.5936213Z                 op = torch.compile(op)
2025-05-07T20:32:16.5936328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5936400Z     
2025-05-07T20:32:16.5936490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5936497Z 
2025-05-07T20:32:16.5936600Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5936733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5936916Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5937036Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5937462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5937561Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5938083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5938183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5938565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5938797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5939160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5939263Z     kernel = self.compile(
2025-05-07T20:32:16.5939670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5939856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5939989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5939994Z 
2025-05-07T20:32:16.5940206Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638379ac0>
2025-05-07T20:32:16.5941113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5941647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385aff60>}
2025-05-07T20:32:16.5942494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5942693Z context = <triton._C.libtriton.ir.context object at 0x7f96380a5970>
2025-05-07T20:32:16.5942698Z 
2025-05-07T20:32:16.5942876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5943153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5943263Z                            module_map=module_map)
2025-05-07T20:32:16.5943435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5943537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5943616Z E       ^
2025-05-07T20:32:16.5944000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5944005Z 
2025-05-07T20:32:16.5944450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5944454Z 
2025-05-07T20:32:16.5944564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5944798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5944877Z     T=1,
2025-05-07T20:32:16.5944963Z     D=5120,
2025-05-07T20:32:16.5945047Z     scale_ub=1200.0,
2025-05-07T20:32:16.5945135Z     contiguous=False,
2025-05-07T20:32:16.5945227Z     compiled=False,
2025-05-07T20:32:16.5945302Z )
2025-05-07T20:32:16.5945531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5945716Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.5945721Z 
2025-05-07T20:32:16.5945803Z     @given(
2025-05-07T20:32:16.5945930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5946031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5946228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5946355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5946471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5946546Z     )
2025-05-07T20:32:16.5946811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5946904Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5946987Z         self,
2025-05-07T20:32:16.5947066Z         T: int,
2025-05-07T20:32:16.5947143Z         D: int,
2025-05-07T20:32:16.5947248Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5947337Z         contiguous: bool,
2025-05-07T20:32:16.5947422Z         compiled: bool,
2025-05-07T20:32:16.5947506Z     ) -> None:
2025-05-07T20:32:16.5947601Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5947677Z     
2025-05-07T20:32:16.5947853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5947927Z     
2025-05-07T20:32:16.5948022Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5948158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5948245Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5948326Z         x0 = x[:, :D]
2025-05-07T20:32:16.5948411Z         x1 = x[:, D:]
2025-05-07T20:32:16.5948483Z     
2025-05-07T20:32:16.5948573Z         if contiguous:
2025-05-07T20:32:16.5948664Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5948797Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5948874Z     
2025-05-07T20:32:16.5948963Z         if scale_ub is not None:
2025-05-07T20:32:16.5949068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5949211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5949289Z             )
2025-05-07T20:32:16.5949407Z         else:
2025-05-07T20:32:16.5949509Z             scale_ub_tensor = None
2025-05-07T20:32:16.5949583Z     
2025-05-07T20:32:16.5949715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5949816Z             op = silu_mul_quant
2025-05-07T20:32:16.5949900Z             if compiled:
2025-05-07T20:32:16.5950006Z                 op = torch.compile(op)
2025-05-07T20:32:16.5950112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5950184Z     
2025-05-07T20:32:16.5950281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5950285Z 
2025-05-07T20:32:16.5950384Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5950519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5950627Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5950728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5951259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5951367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5951746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5951988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5952349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5952443Z     kernel = self.compile(
2025-05-07T20:32:16.5952854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5953036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5953174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5953178Z 
2025-05-07T20:32:16.5953390Z self = <triton.compiler.compiler.ASTSource object at 0x7f963953cda0>
2025-05-07T20:32:16.5954298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5954836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966e980>}
2025-05-07T20:32:16.5955629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5955835Z context = <triton._C.libtriton.ir.context object at 0x7f9639415530>
2025-05-07T20:32:16.5955839Z 
2025-05-07T20:32:16.5956009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5956285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5956401Z                            module_map=module_map)
2025-05-07T20:32:16.5956566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5956679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5956756Z E       ^
2025-05-07T20:32:16.5957131Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5957135Z 
2025-05-07T20:32:16.5957580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5957627Z 
2025-05-07T20:32:16.5957733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5957971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5958049Z     T=16384,
2025-05-07T20:32:16.5958127Z     D=5120,
2025-05-07T20:32:16.5958217Z     scale_ub=1200.0,
2025-05-07T20:32:16.5958343Z     contiguous=False,
2025-05-07T20:32:16.5958431Z     compiled=True,
2025-05-07T20:32:16.5958514Z )
2025-05-07T20:32:16.5958744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5958939Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.5958943Z 
2025-05-07T20:32:16.5959030Z     @given(
2025-05-07T20:32:16.5959152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5959259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5959377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5959496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5959624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5959701Z     )
2025-05-07T20:32:16.5959962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5960063Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5960140Z         self,
2025-05-07T20:32:16.5960220Z         T: int,
2025-05-07T20:32:16.5960304Z         D: int,
2025-05-07T20:32:16.5960404Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5960495Z         contiguous: bool,
2025-05-07T20:32:16.5960596Z         compiled: bool,
2025-05-07T20:32:16.5960674Z     ) -> None:
2025-05-07T20:32:16.5960775Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5960848Z     
2025-05-07T20:32:16.5961022Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5961102Z     
2025-05-07T20:32:16.5961197Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5961324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5961421Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5961501Z         x0 = x[:, :D]
2025-05-07T20:32:16.5961581Z         x1 = x[:, D:]
2025-05-07T20:32:16.5961661Z     
2025-05-07T20:32:16.5961744Z         if contiguous:
2025-05-07T20:32:16.5961835Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5961929Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5962004Z     
2025-05-07T20:32:16.5962093Z         if scale_ub is not None:
2025-05-07T20:32:16.5962205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5962423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5962506Z             )
2025-05-07T20:32:16.5962583Z         else:
2025-05-07T20:32:16.5962677Z             scale_ub_tensor = None
2025-05-07T20:32:16.5962755Z     
2025-05-07T20:32:16.5962886Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5962976Z             op = silu_mul_quant
2025-05-07T20:32:16.5963067Z             if compiled:
2025-05-07T20:32:16.5963169Z                 op = torch.compile(op)
2025-05-07T20:32:16.5963275Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5963356Z     
2025-05-07T20:32:16.5963446Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5963450Z 
2025-05-07T20:32:16.5963554Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5963688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5963789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5963895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5964290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5964384Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5964911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5965011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5965437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5965668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5966026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5966165Z     kernel = self.compile(
2025-05-07T20:32:16.5966569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5966757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5966896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5966901Z 
2025-05-07T20:32:16.5967111Z self = <triton.compiler.compiler.ASTSource object at 0x7f963953c0b0>
2025-05-07T20:32:16.5967997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5968534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966fec0>}
2025-05-07T20:32:16.5969338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5969539Z context = <triton._C.libtriton.ir.context object at 0x7f9639361cb0>
2025-05-07T20:32:16.5969544Z 
2025-05-07T20:32:16.5969712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5969994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5970103Z                            module_map=module_map)
2025-05-07T20:32:16.5970269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5970376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5970452Z E       ^
2025-05-07T20:32:16.5970834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5970840Z 
2025-05-07T20:32:16.5971280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5971284Z 
2025-05-07T20:32:16.5971493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5971736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5971859Z     T=2048,
2025-05-07T20:32:16.5971941Z     D=7168,
2025-05-07T20:32:16.5972023Z     scale_ub=1200.0,
2025-05-07T20:32:16.5972125Z     contiguous=False,
2025-05-07T20:32:16.5972217Z     compiled=True,
2025-05-07T20:32:16.5972293Z )
2025-05-07T20:32:16.5972524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5972716Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.5972721Z 
2025-05-07T20:32:16.5972799Z     @given(
2025-05-07T20:32:16.5972921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5973035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5973152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5973277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5980068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5980160Z     )
2025-05-07T20:32:16.5980433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5980532Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5980620Z         self,
2025-05-07T20:32:16.5980699Z         T: int,
2025-05-07T20:32:16.5980777Z         D: int,
2025-05-07T20:32:16.5980964Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5981056Z         contiguous: bool,
2025-05-07T20:32:16.5981143Z         compiled: bool,
2025-05-07T20:32:16.5981232Z     ) -> None:
2025-05-07T20:32:16.5981335Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5981410Z     
2025-05-07T20:32:16.5981595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5981717Z     
2025-05-07T20:32:16.5981812Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5981947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5982042Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5982131Z         x0 = x[:, :D]
2025-05-07T20:32:16.5982212Z         x1 = x[:, D:]
2025-05-07T20:32:16.5982285Z     
2025-05-07T20:32:16.5982376Z         if contiguous:
2025-05-07T20:32:16.5982469Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5982558Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5982642Z     
2025-05-07T20:32:16.5982731Z         if scale_ub is not None:
2025-05-07T20:32:16.5982845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5982990Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5983070Z             )
2025-05-07T20:32:16.5983150Z         else:
2025-05-07T20:32:16.5983256Z             scale_ub_tensor = None
2025-05-07T20:32:16.5983329Z     
2025-05-07T20:32:16.5983472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5983609Z             op = silu_mul_quant
2025-05-07T20:32:16.5983733Z             if compiled:
2025-05-07T20:32:16.5983866Z                 op = torch.compile(op)
2025-05-07T20:32:16.5983971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5984052Z     
2025-05-07T20:32:16.5984143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5984148Z 
2025-05-07T20:32:16.5984255Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5984391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5984493Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5984607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5984997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.5985094Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.5985615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5985717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5986186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5986421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5986774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5986875Z     kernel = self.compile(
2025-05-07T20:32:16.5987323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5987506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5987645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5987650Z 
2025-05-07T20:32:16.5987865Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638ae1af0>
2025-05-07T20:32:16.5988690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5989208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963966fc40>}
2025-05-07T20:32:16.5989993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5990228Z context = <triton._C.libtriton.ir.context object at 0x7f9638607af0>
2025-05-07T20:32:16.5990233Z 
2025-05-07T20:32:16.5990401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5990718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5990825Z                            module_map=module_map)
2025-05-07T20:32:16.5991003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5991102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5991181Z E       ^
2025-05-07T20:32:16.5991557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5991562Z 
2025-05-07T20:32:16.5991992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5991999Z 
2025-05-07T20:32:16.5992103Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5992341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5992417Z     T=1,
2025-05-07T20:32:16.5992498Z     D=5120,
2025-05-07T20:32:16.5992584Z     scale_ub=None,
2025-05-07T20:32:16.5992671Z     contiguous=False,
2025-05-07T20:32:16.5992762Z     compiled=False,
2025-05-07T20:32:16.5992836Z )
2025-05-07T20:32:16.5993065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5993243Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.5993247Z 
2025-05-07T20:32:16.5993324Z     @given(
2025-05-07T20:32:16.5993445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5993550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5993665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5993790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5993903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5993977Z     )
2025-05-07T20:32:16.5994237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5994333Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5994413Z         self,
2025-05-07T20:32:16.5994494Z         T: int,
2025-05-07T20:32:16.5994571Z         D: int,
2025-05-07T20:32:16.5994668Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5994843Z         contiguous: bool,
2025-05-07T20:32:16.5994931Z         compiled: bool,
2025-05-07T20:32:16.5995010Z     ) -> None:
2025-05-07T20:32:16.5995109Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5995182Z     
2025-05-07T20:32:16.5995360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5995433Z     
2025-05-07T20:32:16.5995525Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5995658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5995746Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5995828Z         x0 = x[:, :D]
2025-05-07T20:32:16.5995916Z         x1 = x[:, D:]
2025-05-07T20:32:16.5995987Z     
2025-05-07T20:32:16.5996072Z         if contiguous:
2025-05-07T20:32:16.5996173Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5996261Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5996334Z     
2025-05-07T20:32:16.5996432Z         if scale_ub is not None:
2025-05-07T20:32:16.5996544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5996687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5996763Z             )
2025-05-07T20:32:16.5996839Z         else:
2025-05-07T20:32:16.5996938Z             scale_ub_tensor = None
2025-05-07T20:32:16.5997010Z     
2025-05-07T20:32:16.5997139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5997234Z             op = silu_mul_quant
2025-05-07T20:32:16.5997365Z             if compiled:
2025-05-07T20:32:16.5997466Z                 op = torch.compile(op)
2025-05-07T20:32:16.5997577Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5997648Z     
2025-05-07T20:32:16.5997762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5997774Z 
2025-05-07T20:32:16.5997930Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5998069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5998175Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5998281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5998800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5998904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5999274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5999506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5999868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5999961Z     kernel = self.compile(
2025-05-07T20:32:16.6000364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6000547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6000682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6000686Z 
2025-05-07T20:32:16.6000903Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638ae3b60>
2025-05-07T20:32:16.6001710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6002237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce38cb80>}
2025-05-07T20:32:16.6003016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6003222Z context = <triton._C.libtriton.ir.context object at 0x7f9638f41830>
2025-05-07T20:32:16.6003226Z 
2025-05-07T20:32:16.6003470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6003742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6003856Z                            module_map=module_map)
2025-05-07T20:32:16.6004020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6004121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6004205Z E       ^
2025-05-07T20:32:16.6004572Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6004577Z 
2025-05-07T20:32:16.6005012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6005019Z 
2025-05-07T20:32:16.6005123Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6005355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6005446Z     T=4096,
2025-05-07T20:32:16.6005524Z     D=7168,
2025-05-07T20:32:16.6005609Z     scale_ub=1200.0,
2025-05-07T20:32:16.6005706Z     contiguous=False,
2025-05-07T20:32:16.6005790Z     compiled=False,
2025-05-07T20:32:16.6005870Z )
2025-05-07T20:32:16.6006097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6006641Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6006749Z 
2025-05-07T20:32:16.6006836Z     @given(
2025-05-07T20:32:16.6006962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6007063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6007210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6007420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6007535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6007618Z     )
2025-05-07T20:32:16.6007877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6007977Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6008054Z         self,
2025-05-07T20:32:16.6008130Z         T: int,
2025-05-07T20:32:16.6008215Z         D: int,
2025-05-07T20:32:16.6008312Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6008401Z         contiguous: bool,
2025-05-07T20:32:16.6008490Z         compiled: bool,
2025-05-07T20:32:16.6008575Z     ) -> None:
2025-05-07T20:32:16.6008669Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6008746Z     
2025-05-07T20:32:16.6008915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6008989Z     
2025-05-07T20:32:16.6009088Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6009216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6009309Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6009388Z         x0 = x[:, :D]
2025-05-07T20:32:16.6009468Z         x1 = x[:, D:]
2025-05-07T20:32:16.6009550Z     
2025-05-07T20:32:16.6009634Z         if contiguous:
2025-05-07T20:32:16.6009724Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6009818Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6009890Z     
2025-05-07T20:32:16.6009982Z         if scale_ub is not None:
2025-05-07T20:32:16.6010094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6010230Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6010309Z             )
2025-05-07T20:32:16.6010392Z         else:
2025-05-07T20:32:16.6010485Z             scale_ub_tensor = None
2025-05-07T20:32:16.6010558Z     
2025-05-07T20:32:16.6010699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6010788Z             op = silu_mul_quant
2025-05-07T20:32:16.6010884Z             if compiled:
2025-05-07T20:32:16.6010983Z                 op = torch.compile(op)
2025-05-07T20:32:16.6011088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6011167Z     
2025-05-07T20:32:16.6011407Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6011412Z 
2025-05-07T20:32:16.6011512Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6011651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6011833Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6011938Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6012467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6012569Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6012953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6013187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6013540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6013647Z     kernel = self.compile(
2025-05-07T20:32:16.6014046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6014230Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6014360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6014411Z 
2025-05-07T20:32:16.6014622Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396afa70>
2025-05-07T20:32:16.6015447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6016007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96d4879800>}
2025-05-07T20:32:16.6016800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6016997Z context = <triton._C.libtriton.ir.context object at 0x7f96386a25b0>
2025-05-07T20:32:16.6017002Z 
2025-05-07T20:32:16.6017173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6017462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6017569Z                            module_map=module_map)
2025-05-07T20:32:16.6017758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6017868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6017967Z E       ^
2025-05-07T20:32:16.6018350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6018354Z 
2025-05-07T20:32:16.6018798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6018803Z 
2025-05-07T20:32:16.6018913Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6019149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6019228Z     T=16384,
2025-05-07T20:32:16.6019313Z     D=7168,
2025-05-07T20:32:16.6019398Z     scale_ub=None,
2025-05-07T20:32:16.6019485Z     contiguous=True,
2025-05-07T20:32:16.6019576Z     compiled=True,
2025-05-07T20:32:16.6019652Z )
2025-05-07T20:32:16.6019880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6020067Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.6020073Z 
2025-05-07T20:32:16.6020151Z     @given(
2025-05-07T20:32:16.6020280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6020460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6020578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6020702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6020817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6020895Z     )
2025-05-07T20:32:16.6021157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6021254Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6021334Z         self,
2025-05-07T20:32:16.6021417Z         T: int,
2025-05-07T20:32:16.6021494Z         D: int,
2025-05-07T20:32:16.6021594Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6021694Z         contiguous: bool,
2025-05-07T20:32:16.6021782Z         compiled: bool,
2025-05-07T20:32:16.6021872Z     ) -> None:
2025-05-07T20:32:16.6021967Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6022041Z     
2025-05-07T20:32:16.6022220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6022302Z     
2025-05-07T20:32:16.6022396Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6022533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6022620Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6022702Z         x0 = x[:, :D]
2025-05-07T20:32:16.6022789Z         x1 = x[:, D:]
2025-05-07T20:32:16.6022861Z     
2025-05-07T20:32:16.6022943Z         if contiguous:
2025-05-07T20:32:16.6023091Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6023181Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6023261Z     
2025-05-07T20:32:16.6023352Z         if scale_ub is not None:
2025-05-07T20:32:16.6023458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6023602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6023723Z             )
2025-05-07T20:32:16.6023800Z         else:
2025-05-07T20:32:16.6023900Z             scale_ub_tensor = None
2025-05-07T20:32:16.6023974Z     
2025-05-07T20:32:16.6024111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6024208Z             op = silu_mul_quant
2025-05-07T20:32:16.6024292Z             if compiled:
2025-05-07T20:32:16.6024392Z                 op = torch.compile(op)
2025-05-07T20:32:16.6024504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6024576Z     
2025-05-07T20:32:16.6024674Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6024681Z 
2025-05-07T20:32:16.6024782Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6024915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6025025Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6025127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6025514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6025617Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6026148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6026253Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6026629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6026861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6027228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6027322Z     kernel = self.compile(
2025-05-07T20:32:16.6027729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6027915Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6028047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6028051Z 
2025-05-07T20:32:16.6028349Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396aeab0>
2025-05-07T20:32:16.6029174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6029705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96cf3f0860>}
2025-05-07T20:32:16.6030511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6030709Z context = <triton._C.libtriton.ir.context object at 0x7f9638600e70>
2025-05-07T20:32:16.6030714Z 
2025-05-07T20:32:16.6030890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6031171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6031283Z                            module_map=module_map)
2025-05-07T20:32:16.6031451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6031551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6031636Z E       ^
2025-05-07T20:32:16.6032011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6032057Z 
2025-05-07T20:32:16.6032497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6032502Z 
2025-05-07T20:32:16.6032616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6032889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6032971Z     T=4096,
2025-05-07T20:32:16.6033048Z     D=5120,
2025-05-07T20:32:16.6033130Z     scale_ub=None,
2025-05-07T20:32:16.6033229Z     contiguous=False,
2025-05-07T20:32:16.6033314Z     compiled=True,
2025-05-07T20:32:16.6033385Z )
2025-05-07T20:32:16.6033622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6033805Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6033809Z 
2025-05-07T20:32:16.6033889Z     @given(
2025-05-07T20:32:16.6034019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6034120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6034244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6034362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6034477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6034562Z     )
2025-05-07T20:32:16.6034818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6034912Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6034999Z         self,
2025-05-07T20:32:16.6035078Z         T: int,
2025-05-07T20:32:16.6035156Z         D: int,
2025-05-07T20:32:16.6035263Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6035353Z         contiguous: bool,
2025-05-07T20:32:16.6035441Z         compiled: bool,
2025-05-07T20:32:16.6035527Z     ) -> None:
2025-05-07T20:32:16.6035622Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6035703Z     
2025-05-07T20:32:16.6035875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6035948Z     
2025-05-07T20:32:16.6036051Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6036177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6036265Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6036359Z         x0 = x[:, :D]
2025-05-07T20:32:16.6036440Z         x1 = x[:, D:]
2025-05-07T20:32:16.6036512Z     
2025-05-07T20:32:16.6036604Z         if contiguous:
2025-05-07T20:32:16.6036695Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6036862Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6036944Z     
2025-05-07T20:32:16.6037036Z         if scale_ub is not None:
2025-05-07T20:32:16.6037148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6037287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6037364Z             )
2025-05-07T20:32:16.6037448Z         else:
2025-05-07T20:32:16.6037546Z             scale_ub_tensor = None
2025-05-07T20:32:16.6037618Z     
2025-05-07T20:32:16.6037755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6037844Z             op = silu_mul_quant
2025-05-07T20:32:16.6037928Z             if compiled:
2025-05-07T20:32:16.6038033Z                 op = torch.compile(op)
2025-05-07T20:32:16.6038143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6038217Z     
2025-05-07T20:32:16.6038314Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6038318Z 
2025-05-07T20:32:16.6038420Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6038564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6038667Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6038779Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6039166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6039305Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6039832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6039930Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6040307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6040609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6040970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6041068Z     kernel = self.compile(
2025-05-07T20:32:16.6041471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6041649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6041787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6041795Z 
2025-05-07T20:32:16.6042004Z self = <triton.compiler.compiler.ASTSource object at 0x7f96396af200>
2025-05-07T20:32:16.6042831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6043368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a579c0>}
2025-05-07T20:32:16.6044162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6044360Z context = <triton._C.libtriton.ir.context object at 0x7f9638f7b070>
2025-05-07T20:32:16.6044364Z 
2025-05-07T20:32:16.6044538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6044819Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6044925Z                            module_map=module_map)
2025-05-07T20:32:16.6045090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6045194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6045269Z E       ^
2025-05-07T20:32:16.6045726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6045731Z 
2025-05-07T20:32:16.6046168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6046172Z 
2025-05-07T20:32:16.6046275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6046513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6046591Z     T=4096,
2025-05-07T20:32:16.6046665Z     D=5120,
2025-05-07T20:32:16.6046752Z     scale_ub=1200.0,
2025-05-07T20:32:16.6046837Z     contiguous=False,
2025-05-07T20:32:16.6046929Z     compiled=False,
2025-05-07T20:32:16.6047004Z )
2025-05-07T20:32:16.6047232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6047422Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6047426Z 
2025-05-07T20:32:16.6047501Z     @given(
2025-05-07T20:32:16.6047624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6047727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6047841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6047958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6048077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6048151Z     )
2025-05-07T20:32:16.6048412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6048548Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6048624Z         self,
2025-05-07T20:32:16.6048702Z         T: int,
2025-05-07T20:32:16.6048778Z         D: int,
2025-05-07T20:32:16.6048875Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6048967Z         contiguous: bool,
2025-05-07T20:32:16.6049092Z         compiled: bool,
2025-05-07T20:32:16.6049169Z     ) -> None:
2025-05-07T20:32:16.6049266Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6049338Z     
2025-05-07T20:32:16.6049519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6049597Z     
2025-05-07T20:32:16.6049688Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6049819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6049905Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6049985Z         x0 = x[:, :D]
2025-05-07T20:32:16.6050069Z         x1 = x[:, D:]
2025-05-07T20:32:16.6050145Z     
2025-05-07T20:32:16.6050227Z         if contiguous:
2025-05-07T20:32:16.6050323Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6050411Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6050481Z     
2025-05-07T20:32:16.6050577Z         if scale_ub is not None:
2025-05-07T20:32:16.6050683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6050824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6050903Z             )
2025-05-07T20:32:16.6050978Z         else:
2025-05-07T20:32:16.6051071Z             scale_ub_tensor = None
2025-05-07T20:32:16.6051157Z     
2025-05-07T20:32:16.6051287Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6051380Z             op = silu_mul_quant
2025-05-07T20:32:16.6051463Z             if compiled:
2025-05-07T20:32:16.6051564Z                 op = torch.compile(op)
2025-05-07T20:32:16.6051672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6051744Z     
2025-05-07T20:32:16.6051887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6051891Z 
2025-05-07T20:32:16.6051998Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6052129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6052231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6052338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6052870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6052971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6053433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6053667Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6054031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6054127Z     kernel = self.compile(
2025-05-07T20:32:16.6054532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6054711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6054841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6054848Z 
2025-05-07T20:32:16.6055063Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639191a30>
2025-05-07T20:32:16.6055899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6056432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a56d40>}
2025-05-07T20:32:16.6057268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6057463Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f7b830>
2025-05-07T20:32:16.6057468Z 
2025-05-07T20:32:16.6057681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6057957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6058076Z                            module_map=module_map)
2025-05-07T20:32:16.6058241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6058340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6058420Z E       ^
2025-05-07T20:32:16.6058796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6058801Z 
2025-05-07T20:32:16.6059241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6059251Z 
2025-05-07T20:32:16.6059354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6059587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6059673Z     T=4096,
2025-05-07T20:32:16.6059748Z     D=5120,
2025-05-07T20:32:16.6059832Z     scale_ub=1200.0,
2025-05-07T20:32:16.6059924Z     contiguous=False,
2025-05-07T20:32:16.6060006Z     compiled=True,
2025-05-07T20:32:16.6060076Z )
2025-05-07T20:32:16.6060313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6060493Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6060498Z 
2025-05-07T20:32:16.6060578Z     @given(
2025-05-07T20:32:16.6060697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6060796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6060919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6061035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6061148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6061228Z     )
2025-05-07T20:32:16.6061480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6061576Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6061659Z         self,
2025-05-07T20:32:16.6061733Z         T: int,
2025-05-07T20:32:16.6061807Z         D: int,
2025-05-07T20:32:16.6061990Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6062081Z         contiguous: bool,
2025-05-07T20:32:16.6062171Z         compiled: bool,
2025-05-07T20:32:16.6062250Z     ) -> None:
2025-05-07T20:32:16.6062344Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6062419Z     
2025-05-07T20:32:16.6062589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6062666Z     
2025-05-07T20:32:16.6062761Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6062886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6062973Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6063062Z         x0 = x[:, :D]
2025-05-07T20:32:16.6063140Z         x1 = x[:, D:]
2025-05-07T20:32:16.6063213Z     
2025-05-07T20:32:16.6063300Z         if contiguous:
2025-05-07T20:32:16.6063390Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6063483Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6063552Z     
2025-05-07T20:32:16.6063644Z         if scale_ub is not None:
2025-05-07T20:32:16.6063753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6063890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6063967Z             )
2025-05-07T20:32:16.6064047Z         else:
2025-05-07T20:32:16.6064140Z             scale_ub_tensor = None
2025-05-07T20:32:16.6064210Z     
2025-05-07T20:32:16.6064344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6064480Z             op = silu_mul_quant
2025-05-07T20:32:16.6064565Z             if compiled:
2025-05-07T20:32:16.6064668Z                 op = torch.compile(op)
2025-05-07T20:32:16.6064773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6064853Z     
2025-05-07T20:32:16.6064985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6064989Z 
2025-05-07T20:32:16.6065085Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6065219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6065322Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6065421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6065810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6065902Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6066423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6066528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6066900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6067134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6067496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6067588Z     kernel = self.compile(
2025-05-07T20:32:16.6067994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6068171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6068304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6068308Z 
2025-05-07T20:32:16.6068517Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639193aa0>
2025-05-07T20:32:16.6069350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6069877Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a549a0>}
2025-05-07T20:32:16.6070747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6070948Z context = <triton._C.libtriton.ir.context object at 0x7f96387d1230>
2025-05-07T20:32:16.6070953Z 
2025-05-07T20:32:16.6071122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6071399Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6071516Z                            module_map=module_map)
2025-05-07T20:32:16.6071680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6071783Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6071862Z E       ^
2025-05-07T20:32:16.6072236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6072240Z 
2025-05-07T20:32:16.6072688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6072693Z 
2025-05-07T20:32:16.6072795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6073034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6073111Z     T=2048,
2025-05-07T20:32:16.6073186Z     D=7168,
2025-05-07T20:32:16.6073273Z     scale_ub=1200.0,
2025-05-07T20:32:16.6073425Z     contiguous=False,
2025-05-07T20:32:16.6073509Z     compiled=False,
2025-05-07T20:32:16.6073584Z )
2025-05-07T20:32:16.6073809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6073992Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6074036Z 
2025-05-07T20:32:16.6074116Z     @given(
2025-05-07T20:32:16.6074239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6074343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6074467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6074584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6074701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6074776Z     )
2025-05-07T20:32:16.6075030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6075128Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6075206Z         self,
2025-05-07T20:32:16.6075280Z         T: int,
2025-05-07T20:32:16.6075360Z         D: int,
2025-05-07T20:32:16.6075458Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6075545Z         contiguous: bool,
2025-05-07T20:32:16.6075634Z         compiled: bool,
2025-05-07T20:32:16.6075709Z     ) -> None:
2025-05-07T20:32:16.6075809Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6075882Z     
2025-05-07T20:32:16.6076053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6076131Z     
2025-05-07T20:32:16.6076230Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6076354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6076448Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6076529Z         x0 = x[:, :D]
2025-05-07T20:32:16.6076608Z         x1 = x[:, D:]
2025-05-07T20:32:16.6076682Z     
2025-05-07T20:32:16.6076766Z         if contiguous:
2025-05-07T20:32:16.6076856Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6076948Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6077018Z     
2025-05-07T20:32:16.6077109Z         if scale_ub is not None:
2025-05-07T20:32:16.6077212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6077348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6077428Z             )
2025-05-07T20:32:16.6077504Z         else:
2025-05-07T20:32:16.6077596Z             scale_ub_tensor = None
2025-05-07T20:32:16.6077671Z     
2025-05-07T20:32:16.6077800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6077971Z             op = silu_mul_quant
2025-05-07T20:32:16.6078061Z             if compiled:
2025-05-07T20:32:16.6078157Z                 op = torch.compile(op)
2025-05-07T20:32:16.6078266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6078341Z     
2025-05-07T20:32:16.6078429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6078434Z 
2025-05-07T20:32:16.6078534Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6078672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6078771Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6078873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6079400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6079498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6079887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6080117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6080480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6080573Z     kernel = self.compile(
2025-05-07T20:32:16.6080974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6081200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6081329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6081333Z 
2025-05-07T20:32:16.6081546Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639efb770>
2025-05-07T20:32:16.6082411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6082944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638a556c0>}
2025-05-07T20:32:16.6083744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6083943Z context = <triton._C.libtriton.ir.context object at 0x7f9638117ab0>
2025-05-07T20:32:16.6083948Z 
2025-05-07T20:32:16.6084121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6084462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6084619Z                            module_map=module_map)
2025-05-07T20:32:16.6084793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6084894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6084970Z E       ^
2025-05-07T20:32:16.6085347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6085351Z 
2025-05-07T20:32:16.6085786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6085794Z 
2025-05-07T20:32:16.6085902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6086134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6086209Z     T=1,
2025-05-07T20:32:16.6086288Z     D=7168,
2025-05-07T20:32:16.6086370Z     scale_ub=None,
2025-05-07T20:32:16.6086456Z     contiguous=True,
2025-05-07T20:32:16.6086544Z     compiled=False,
2025-05-07T20:32:16.6086613Z )
2025-05-07T20:32:16.6086843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6087110Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6087115Z 
2025-05-07T20:32:16.6087191Z     @given(
2025-05-07T20:32:16.6087317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6087414Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6087525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6087648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6087765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6087841Z     )
2025-05-07T20:32:16.6088095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6088186Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6088267Z         self,
2025-05-07T20:32:16.6088342Z         T: int,
2025-05-07T20:32:16.6088416Z         D: int,
2025-05-07T20:32:16.6088517Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6088604Z         contiguous: bool,
2025-05-07T20:32:16.6088696Z         compiled: bool,
2025-05-07T20:32:16.6088780Z     ) -> None:
2025-05-07T20:32:16.6088872Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6088944Z     
2025-05-07T20:32:16.6089121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6089194Z     
2025-05-07T20:32:16.6089284Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6089419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6089549Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6089633Z         x0 = x[:, :D]
2025-05-07T20:32:16.6089711Z         x1 = x[:, D:]
2025-05-07T20:32:16.6089782Z     
2025-05-07T20:32:16.6089869Z         if contiguous:
2025-05-07T20:32:16.6089958Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6090083Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6090160Z     
2025-05-07T20:32:16.6090248Z         if scale_ub is not None:
2025-05-07T20:32:16.6090351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6090495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6090569Z             )
2025-05-07T20:32:16.6090643Z         else:
2025-05-07T20:32:16.6090740Z             scale_ub_tensor = None
2025-05-07T20:32:16.6090810Z     
2025-05-07T20:32:16.6090944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6091034Z             op = silu_mul_quant
2025-05-07T20:32:16.6091122Z             if compiled:
2025-05-07T20:32:16.6091224Z                 op = torch.compile(op)
2025-05-07T20:32:16.6091330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6091400Z     
2025-05-07T20:32:16.6091493Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6091497Z 
2025-05-07T20:32:16.6091593Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6091726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6091883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6091983Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6092520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6092618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6092996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6093237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6093598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6093691Z     kernel = self.compile(
2025-05-07T20:32:16.6094097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6094277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6094412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6094498Z 
2025-05-07T20:32:16.6094711Z self = <triton.compiler.compiler.ASTSource object at 0x7f96cf000920>
2025-05-07T20:32:16.6095535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6096068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963914de40>}
2025-05-07T20:32:16.6096864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6097080Z context = <triton._C.libtriton.ir.context object at 0x7f96387ca430>
2025-05-07T20:32:16.6097086Z 
2025-05-07T20:32:16.6097286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6097571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6097676Z                            module_map=module_map)
2025-05-07T20:32:16.6097839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6097942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6098061Z E       ^
2025-05-07T20:32:16.6098433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6098438Z 
2025-05-07T20:32:16.6098879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6098922Z 
2025-05-07T20:32:16.6099026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6099261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6099336Z     T=16384,
2025-05-07T20:32:16.6099416Z     D=7168,
2025-05-07T20:32:16.6099500Z     scale_ub=1200.0,
2025-05-07T20:32:16.6099585Z     contiguous=False,
2025-05-07T20:32:16.6099667Z     compiled=True,
2025-05-07T20:32:16.6099743Z )
2025-05-07T20:32:16.6099973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6100157Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6100169Z 
2025-05-07T20:32:16.6100245Z     @given(
2025-05-07T20:32:16.6100364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6100471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6100589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6100706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6100826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6100898Z     )
2025-05-07T20:32:16.6101159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6101255Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6101328Z         self,
2025-05-07T20:32:16.6101406Z         T: int,
2025-05-07T20:32:16.6101485Z         D: int,
2025-05-07T20:32:16.6101584Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6101678Z         contiguous: bool,
2025-05-07T20:32:16.6101764Z         compiled: bool,
2025-05-07T20:32:16.6101843Z     ) -> None:
2025-05-07T20:32:16.6101944Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6102016Z     
2025-05-07T20:32:16.6102185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6102263Z     
2025-05-07T20:32:16.6102353Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6102478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6107570Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6107675Z         x0 = x[:, :D]
2025-05-07T20:32:16.6107757Z         x1 = x[:, D:]
2025-05-07T20:32:16.6107827Z     
2025-05-07T20:32:16.6108106Z         if contiguous:
2025-05-07T20:32:16.6108205Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6108291Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6108362Z     
2025-05-07T20:32:16.6108451Z         if scale_ub is not None:
2025-05-07T20:32:16.6108555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6108695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6108778Z             )
2025-05-07T20:32:16.6108853Z         else:
2025-05-07T20:32:16.6108950Z             scale_ub_tensor = None
2025-05-07T20:32:16.6109022Z     
2025-05-07T20:32:16.6109155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6109248Z             op = silu_mul_quant
2025-05-07T20:32:16.6109331Z             if compiled:
2025-05-07T20:32:16.6109433Z                 op = torch.compile(op)
2025-05-07T20:32:16.6109539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6109609Z     
2025-05-07T20:32:16.6109696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6109707Z 
2025-05-07T20:32:16.6109806Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6109937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6110040Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6110138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6110530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6110692Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6111210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6111307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6111739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6111967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6112327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6112421Z     kernel = self.compile(
2025-05-07T20:32:16.6112815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6112997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6113137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6113141Z 
2025-05-07T20:32:16.6113353Z self = <triton.compiler.compiler.ASTSource object at 0x7f96d45f4d70>
2025-05-07T20:32:16.6114159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6114687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f963914fb00>}
2025-05-07T20:32:16.6115461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6115659Z context = <triton._C.libtriton.ir.context object at 0x7f94f7ff18f0>
2025-05-07T20:32:16.6115665Z 
2025-05-07T20:32:16.6115836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6116112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6116218Z                            module_map=module_map)
2025-05-07T20:32:16.6116385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6116485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6116563Z E       ^
2025-05-07T20:32:16.6117012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6117018Z 
2025-05-07T20:32:16.6117452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6117456Z 
2025-05-07T20:32:16.6117559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6117821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6117912Z     T=1,
2025-05-07T20:32:16.6117995Z     D=7168,
2025-05-07T20:32:16.6118082Z     scale_ub=None,
2025-05-07T20:32:16.6118166Z     contiguous=False,
2025-05-07T20:32:16.6118248Z     compiled=False,
2025-05-07T20:32:16.6118329Z )
2025-05-07T20:32:16.6118552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6118726Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6118731Z 
2025-05-07T20:32:16.6118810Z     @given(
2025-05-07T20:32:16.6118930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6119032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6119146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6119260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6119377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6119493Z     )
2025-05-07T20:32:16.6119745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6119842Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6119915Z         self,
2025-05-07T20:32:16.6119993Z         T: int,
2025-05-07T20:32:16.6120069Z         D: int,
2025-05-07T20:32:16.6120210Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6120301Z         contiguous: bool,
2025-05-07T20:32:16.6120384Z         compiled: bool,
2025-05-07T20:32:16.6120461Z     ) -> None:
2025-05-07T20:32:16.6120557Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6120632Z     
2025-05-07T20:32:16.6120803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6120877Z     
2025-05-07T20:32:16.6120966Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6121089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6121180Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6121259Z         x0 = x[:, :D]
2025-05-07T20:32:16.6121340Z         x1 = x[:, D:]
2025-05-07T20:32:16.6121414Z     
2025-05-07T20:32:16.6121496Z         if contiguous:
2025-05-07T20:32:16.6121590Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6121675Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6121746Z     
2025-05-07T20:32:16.6121838Z         if scale_ub is not None:
2025-05-07T20:32:16.6121944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6122078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6122154Z             )
2025-05-07T20:32:16.6122233Z         else:
2025-05-07T20:32:16.6122327Z             scale_ub_tensor = None
2025-05-07T20:32:16.6122403Z     
2025-05-07T20:32:16.6122532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6122619Z             op = silu_mul_quant
2025-05-07T20:32:16.6122706Z             if compiled:
2025-05-07T20:32:16.6122805Z                 op = torch.compile(op)
2025-05-07T20:32:16.6122913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6122984Z     
2025-05-07T20:32:16.6123074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6123078Z 
2025-05-07T20:32:16.6123174Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6123304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6123403Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6123505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6124099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6124199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6124573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6124800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6125155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6125250Z     kernel = self.compile(
2025-05-07T20:32:16.6125645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6125828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6125961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6125966Z 
2025-05-07T20:32:16.6126179Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639175f10>
2025-05-07T20:32:16.6126992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6127513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96ce1200e0>}
2025-05-07T20:32:16.6128340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6128536Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f995f0>
2025-05-07T20:32:16.6128578Z 
2025-05-07T20:32:16.6128752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6129032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6129139Z                            module_map=module_map)
2025-05-07T20:32:16.6129308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6129407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6129486Z E       ^
2025-05-07T20:32:16.6129859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6129866Z 
2025-05-07T20:32:16.6130302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6130307Z 
2025-05-07T20:32:16.6130412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6130646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6130723Z     T=2048,
2025-05-07T20:32:16.6130797Z     D=7168,
2025-05-07T20:32:16.6130878Z     scale_ub=None,
2025-05-07T20:32:16.6130966Z     contiguous=False,
2025-05-07T20:32:16.6131052Z     compiled=True,
2025-05-07T20:32:16.6131123Z )
2025-05-07T20:32:16.6131353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6131531Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6131535Z 
2025-05-07T20:32:16.6131609Z     @given(
2025-05-07T20:32:16.6131731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6131916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6132035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6132151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6132264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6132344Z     )
2025-05-07T20:32:16.6132595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6132687Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6132765Z         self,
2025-05-07T20:32:16.6132925Z         T: int,
2025-05-07T20:32:16.6133004Z         D: int,
2025-05-07T20:32:16.6133105Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6133193Z         contiguous: bool,
2025-05-07T20:32:16.6133276Z         compiled: bool,
2025-05-07T20:32:16.6133357Z     ) -> None:
2025-05-07T20:32:16.6133449Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6133524Z     
2025-05-07T20:32:16.6133693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6133768Z     
2025-05-07T20:32:16.6133861Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6133984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6134070Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6134151Z         x0 = x[:, :D]
2025-05-07T20:32:16.6134237Z         x1 = x[:, D:]
2025-05-07T20:32:16.6134307Z     
2025-05-07T20:32:16.6134391Z         if contiguous:
2025-05-07T20:32:16.6134480Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6134566Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6134646Z     
2025-05-07T20:32:16.6134735Z         if scale_ub is not None:
2025-05-07T20:32:16.6134836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6134976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6135050Z             )
2025-05-07T20:32:16.6135126Z         else:
2025-05-07T20:32:16.6135218Z             scale_ub_tensor = None
2025-05-07T20:32:16.6135333Z     
2025-05-07T20:32:16.6135463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6135550Z             op = silu_mul_quant
2025-05-07T20:32:16.6135633Z             if compiled:
2025-05-07T20:32:16.6135734Z                 op = torch.compile(op)
2025-05-07T20:32:16.6135839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6135950Z     
2025-05-07T20:32:16.6136041Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6136045Z 
2025-05-07T20:32:16.6136138Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6136277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6136375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6136475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6136861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6136953Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6137525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6137627Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6138002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6138239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6138593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6138690Z     kernel = self.compile(
2025-05-07T20:32:16.6139094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6139275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6139405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6139417Z 
2025-05-07T20:32:16.6139626Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639176000>
2025-05-07T20:32:16.6140445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6140984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a56ac0>}
2025-05-07T20:32:16.6141875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6142073Z context = <triton._C.libtriton.ir.context object at 0x7f94f7f1ff70>
2025-05-07T20:32:16.6142078Z 
2025-05-07T20:32:16.6142249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6142525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6142634Z                            module_map=module_map)
2025-05-07T20:32:16.6142797Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6142899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6142979Z E       ^
2025-05-07T20:32:16.6143352Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6143363Z 
2025-05-07T20:32:16.6143803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6143807Z 
2025-05-07T20:32:16.6143910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6144142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6144220Z     T=4096,
2025-05-07T20:32:16.6144338Z     D=7168,
2025-05-07T20:32:16.6144422Z     scale_ub=None,
2025-05-07T20:32:16.6144506Z     contiguous=False,
2025-05-07T20:32:16.6144589Z     compiled=True,
2025-05-07T20:32:16.6144661Z )
2025-05-07T20:32:16.6144884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6145062Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6145107Z 
2025-05-07T20:32:16.6145183Z     @given(
2025-05-07T20:32:16.6145301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6145403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6145518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6145635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6145751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6145824Z     )
2025-05-07T20:32:16.6146076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6146172Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6146247Z         self,
2025-05-07T20:32:16.6146322Z         T: int,
2025-05-07T20:32:16.6146397Z         D: int,
2025-05-07T20:32:16.6146492Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6146578Z         contiguous: bool,
2025-05-07T20:32:16.6146664Z         compiled: bool,
2025-05-07T20:32:16.6146742Z     ) -> None:
2025-05-07T20:32:16.6146835Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6146906Z     
2025-05-07T20:32:16.6147075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6147154Z     
2025-05-07T20:32:16.6147245Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6147368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6147457Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6147534Z         x0 = x[:, :D]
2025-05-07T20:32:16.6147613Z         x1 = x[:, D:]
2025-05-07T20:32:16.6147703Z     
2025-05-07T20:32:16.6147790Z         if contiguous:
2025-05-07T20:32:16.6147904Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6147996Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6148068Z     
2025-05-07T20:32:16.6148156Z         if scale_ub is not None:
2025-05-07T20:32:16.6148261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6148397Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6148473Z             )
2025-05-07T20:32:16.6148549Z         else:
2025-05-07T20:32:16.6148641Z             scale_ub_tensor = None
2025-05-07T20:32:16.6148711Z     
2025-05-07T20:32:16.6148918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6149008Z             op = silu_mul_quant
2025-05-07T20:32:16.6149095Z             if compiled:
2025-05-07T20:32:16.6149191Z                 op = torch.compile(op)
2025-05-07T20:32:16.6149296Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6149369Z     
2025-05-07T20:32:16.6149456Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6149463Z 
2025-05-07T20:32:16.6149558Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6149691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6149788Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6149889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6150275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6150370Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6150898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6150996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6151370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6151605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6152005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6152103Z     kernel = self.compile(
2025-05-07T20:32:16.6152504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6152723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6152855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6152860Z 
2025-05-07T20:32:16.6153077Z self = <triton.compiler.compiler.ASTSource object at 0x7f963932f7a0>
2025-05-07T20:32:16.6153897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6154423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a55b20>}
2025-05-07T20:32:16.6155215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6155417Z context = <triton._C.libtriton.ir.context object at 0x7f96381890f0>
2025-05-07T20:32:16.6155421Z 
2025-05-07T20:32:16.6155591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6155875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6155980Z                            module_map=module_map)
2025-05-07T20:32:16.6156144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6156248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6156324Z E       ^
2025-05-07T20:32:16.6156699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6156708Z 
2025-05-07T20:32:16.6157146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6157150Z 
2025-05-07T20:32:16.6157253Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6157486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6157562Z     T=16384,
2025-05-07T20:32:16.6157635Z     D=5120,
2025-05-07T20:32:16.6157794Z     scale_ub=1200.0,
2025-05-07T20:32:16.6157879Z     contiguous=False,
2025-05-07T20:32:16.6157969Z     compiled=False,
2025-05-07T20:32:16.6158038Z )
2025-05-07T20:32:16.6158262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6158449Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6158453Z 
2025-05-07T20:32:16.6158529Z     @given(
2025-05-07T20:32:16.6158647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6158749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6158862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6158976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6159094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6159165Z     )
2025-05-07T20:32:16.6159419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6159517Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6159590Z         self,
2025-05-07T20:32:16.6159666Z         T: int,
2025-05-07T20:32:16.6159740Z         D: int,
2025-05-07T20:32:16.6159834Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6159926Z         contiguous: bool,
2025-05-07T20:32:16.6160014Z         compiled: bool,
2025-05-07T20:32:16.6160091Z     ) -> None:
2025-05-07T20:32:16.6160188Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6160302Z     
2025-05-07T20:32:16.6160470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6160545Z     
2025-05-07T20:32:16.6160634Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6160758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6160843Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6160963Z         x0 = x[:, :D]
2025-05-07T20:32:16.6161043Z         x1 = x[:, D:]
2025-05-07T20:32:16.6161113Z     
2025-05-07T20:32:16.6161194Z         if contiguous:
2025-05-07T20:32:16.6161291Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6161378Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6161448Z     
2025-05-07T20:32:16.6161538Z         if scale_ub is not None:
2025-05-07T20:32:16.6161641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6161775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6161851Z             )
2025-05-07T20:32:16.6161930Z         else:
2025-05-07T20:32:16.6162021Z             scale_ub_tensor = None
2025-05-07T20:32:16.6162094Z     
2025-05-07T20:32:16.6162220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6162311Z             op = silu_mul_quant
2025-05-07T20:32:16.6162395Z             if compiled:
2025-05-07T20:32:16.6162490Z                 op = torch.compile(op)
2025-05-07T20:32:16.6162599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6162669Z     
2025-05-07T20:32:16.6162755Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6162760Z 
2025-05-07T20:32:16.6162862Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6162990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6163087Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6163186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6163711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6163816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6164190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6164420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6164784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6164876Z     kernel = self.compile(
2025-05-07T20:32:16.6165366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6165550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6165680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6165685Z 
2025-05-07T20:32:16.6165897Z self = <triton.compiler.compiler.ASTSource object at 0x7f963932c6e0>
2025-05-07T20:32:16.6166714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6167248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9639a54c20>}
2025-05-07T20:32:16.6168093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6168291Z context = <triton._C.libtriton.ir.context object at 0x7f96381e72f0>
2025-05-07T20:32:16.6168295Z 
2025-05-07T20:32:16.6168464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6168738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6168889Z                            module_map=module_map)
2025-05-07T20:32:16.6169055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6169158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6169234Z E       ^
2025-05-07T20:32:16.6169605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6169675Z 
2025-05-07T20:32:16.6170122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6170126Z 
2025-05-07T20:32:16.6170228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6170464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6170539Z     T=16384,
2025-05-07T20:32:16.6170613Z     D=5120,
2025-05-07T20:32:16.6170695Z     scale_ub=1200.0,
2025-05-07T20:32:16.6170778Z     contiguous=True,
2025-05-07T20:32:16.6170862Z     compiled=True,
2025-05-07T20:32:16.6170934Z )
2025-05-07T20:32:16.6171157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6171336Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6171340Z 
2025-05-07T20:32:16.6171421Z     @given(
2025-05-07T20:32:16.6171539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6171637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6171809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6171945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6172060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6172133Z     )
2025-05-07T20:32:16.6172385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6172480Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6172553Z         self,
2025-05-07T20:32:16.6172632Z         T: int,
2025-05-07T20:32:16.6172710Z         D: int,
2025-05-07T20:32:16.6172806Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6172891Z         contiguous: bool,
2025-05-07T20:32:16.6172976Z         compiled: bool,
2025-05-07T20:32:16.6173052Z     ) -> None:
2025-05-07T20:32:16.6173146Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6173219Z     
2025-05-07T20:32:16.6173387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6173461Z     
2025-05-07T20:32:16.6173549Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6173756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6173853Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6173935Z         x0 = x[:, :D]
2025-05-07T20:32:16.6174011Z         x1 = x[:, D:]
2025-05-07T20:32:16.6174085Z     
2025-05-07T20:32:16.6174167Z         if contiguous:
2025-05-07T20:32:16.6174255Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6174346Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6174418Z     
2025-05-07T20:32:16.6174504Z         if scale_ub is not None:
2025-05-07T20:32:16.6174610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6174744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6174819Z             )
2025-05-07T20:32:16.6174893Z         else:
2025-05-07T20:32:16.6174986Z             scale_ub_tensor = None
2025-05-07T20:32:16.6175056Z     
2025-05-07T20:32:16.6175185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6175272Z             op = silu_mul_quant
2025-05-07T20:32:16.6175361Z             if compiled:
2025-05-07T20:32:16.6175457Z                 op = torch.compile(op)
2025-05-07T20:32:16.6175559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6175634Z     
2025-05-07T20:32:16.6175721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6175725Z 
2025-05-07T20:32:16.6175822Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6175950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6176093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6176193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6176578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6176709Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6177230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6177330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6177707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6177935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6178289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6178386Z     kernel = self.compile(
2025-05-07T20:32:16.6178783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6178961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6179092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6179100Z 
2025-05-07T20:32:16.6179308Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0a270>
2025-05-07T20:32:16.6180128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6180650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96385af060>}
2025-05-07T20:32:16.6181445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6181638Z context = <triton._C.libtriton.ir.context object at 0x7f9638fe1b30>
2025-05-07T20:32:16.6181645Z 
2025-05-07T20:32:16.6181812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6182087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6182265Z                            module_map=module_map)
2025-05-07T20:32:16.6182429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6182530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6182604Z E       ^
2025-05-07T20:32:16.6182976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6182983Z 
2025-05-07T20:32:16.6183416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6183420Z 
2025-05-07T20:32:16.6183523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6183757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6183834Z     T=16384,
2025-05-07T20:32:16.6183911Z     D=5120,
2025-05-07T20:32:16.6183991Z     scale_ub=None,
2025-05-07T20:32:16.6184075Z     contiguous=False,
2025-05-07T20:32:16.6184162Z     compiled=True,
2025-05-07T20:32:16.6184237Z )
2025-05-07T20:32:16.6184461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6184644Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6184649Z 
2025-05-07T20:32:16.6184722Z     @given(
2025-05-07T20:32:16.6184839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6184982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6185095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6185213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6185323Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6185394Z     )
2025-05-07T20:32:16.6185650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6185782Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6185856Z         self,
2025-05-07T20:32:16.6185935Z         T: int,
2025-05-07T20:32:16.6186013Z         D: int,
2025-05-07T20:32:16.6186109Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6186199Z         contiguous: bool,
2025-05-07T20:32:16.6186282Z         compiled: bool,
2025-05-07T20:32:16.6186358Z     ) -> None:
2025-05-07T20:32:16.6186453Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6186525Z     
2025-05-07T20:32:16.6186693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6186769Z     
2025-05-07T20:32:16.6186857Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6186983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6187067Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6187144Z         x0 = x[:, :D]
2025-05-07T20:32:16.6187226Z         x1 = x[:, D:]
2025-05-07T20:32:16.6187297Z     
2025-05-07T20:32:16.6187378Z         if contiguous:
2025-05-07T20:32:16.6187472Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6187559Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6187629Z     
2025-05-07T20:32:16.6187725Z         if scale_ub is not None:
2025-05-07T20:32:16.6187826Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6187960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6188039Z             )
2025-05-07T20:32:16.6188114Z         else:
2025-05-07T20:32:16.6188207Z             scale_ub_tensor = None
2025-05-07T20:32:16.6188275Z     
2025-05-07T20:32:16.6188404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6188495Z             op = silu_mul_quant
2025-05-07T20:32:16.6188577Z             if compiled:
2025-05-07T20:32:16.6188672Z                 op = torch.compile(op)
2025-05-07T20:32:16.6188779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6188850Z     
2025-05-07T20:32:16.6188939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6188944Z 
2025-05-07T20:32:16.6189041Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6189169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6189355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6189454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6189839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6189935Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6190451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6190549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6190925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6191152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6191509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6191600Z     kernel = self.compile(
2025-05-07T20:32:16.6192003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6192185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6192313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6192318Z 
2025-05-07T20:32:16.6192529Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0a930>
2025-05-07T20:32:16.6193385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6193945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638200b80>}
2025-05-07T20:32:16.6194736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6194928Z context = <triton._C.libtriton.ir.context object at 0x7f9638f02130>
2025-05-07T20:32:16.6194933Z 
2025-05-07T20:32:16.6195102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6195375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6195480Z                            module_map=module_map)
2025-05-07T20:32:16.6195644Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6195740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6195817Z E       ^
2025-05-07T20:32:16.6196185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6196190Z 
2025-05-07T20:32:16.6196620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6196624Z 
2025-05-07T20:32:16.6196726Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6196953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6197031Z     T=2048,
2025-05-07T20:32:16.6197109Z     D=5120,
2025-05-07T20:32:16.6197192Z     scale_ub=None,
2025-05-07T20:32:16.6197278Z     contiguous=False,
2025-05-07T20:32:16.6197362Z     compiled=True,
2025-05-07T20:32:16.6197432Z )
2025-05-07T20:32:16.6197657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6197833Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6197840Z 
2025-05-07T20:32:16.6197913Z     @given(
2025-05-07T20:32:16.6198033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6198129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6199067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6199207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6199319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6199395Z     )
2025-05-07T20:32:16.6199648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6199740Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6199822Z         self,
2025-05-07T20:32:16.6199898Z         T: int,
2025-05-07T20:32:16.6199971Z         D: int,
2025-05-07T20:32:16.6200071Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6200157Z         contiguous: bool,
2025-05-07T20:32:16.6200241Z         compiled: bool,
2025-05-07T20:32:16.6200318Z     ) -> None:
2025-05-07T20:32:16.6200413Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6200483Z     
2025-05-07T20:32:16.6200656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6200727Z     
2025-05-07T20:32:16.6200825Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6200960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6201046Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6201127Z         x0 = x[:, :D]
2025-05-07T20:32:16.6201203Z         x1 = x[:, D:]
2025-05-07T20:32:16.6201271Z     
2025-05-07T20:32:16.6201355Z         if contiguous:
2025-05-07T20:32:16.6201445Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6201604Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6201676Z     
2025-05-07T20:32:16.6201762Z         if scale_ub is not None:
2025-05-07T20:32:16.6201865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6202007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6202122Z             )
2025-05-07T20:32:16.6202196Z         else:
2025-05-07T20:32:16.6202291Z             scale_ub_tensor = None
2025-05-07T20:32:16.6202360Z     
2025-05-07T20:32:16.6202493Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6202588Z             op = silu_mul_quant
2025-05-07T20:32:16.6202671Z             if compiled:
2025-05-07T20:32:16.6202774Z                 op = torch.compile(op)
2025-05-07T20:32:16.6202880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6202948Z     
2025-05-07T20:32:16.6203040Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6203044Z 
2025-05-07T20:32:16.6203140Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6203270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6203373Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6203469Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6203851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6203947Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6204462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6204560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6204927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6205151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6205508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6205602Z     kernel = self.compile(
2025-05-07T20:32:16.6206002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6206465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6206606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6206611Z 
2025-05-07T20:32:16.6206820Z self = <triton.compiler.compiler.ASTSource object at 0x7f96394c9400>
2025-05-07T20:32:16.6207773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6208296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96382020c0>}
2025-05-07T20:32:16.6209077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6209267Z context = <triton._C.libtriton.ir.context object at 0x7f9638f6db30>
2025-05-07T20:32:16.6209278Z 
2025-05-07T20:32:16.6209445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6209725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6209836Z                            module_map=module_map)
2025-05-07T20:32:16.6209998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6210096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6210172Z E       ^
2025-05-07T20:32:16.6210537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6210602Z 
2025-05-07T20:32:16.6211033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6211038Z 
2025-05-07T20:32:16.6211138Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6211425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6211505Z     T=2048,
2025-05-07T20:32:16.6211578Z     D=5120,
2025-05-07T20:32:16.6211661Z     scale_ub=1200.0,
2025-05-07T20:32:16.6211814Z     contiguous=False,
2025-05-07T20:32:16.6211897Z     compiled=True,
2025-05-07T20:32:16.6211970Z )
2025-05-07T20:32:16.6212198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6212374Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6212379Z 
2025-05-07T20:32:16.6212456Z     @given(
2025-05-07T20:32:16.6212572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6212671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6212788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6212902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6213013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6213093Z     )
2025-05-07T20:32:16.6213344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6213438Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6213512Z         self,
2025-05-07T20:32:16.6213591Z         T: int,
2025-05-07T20:32:16.6213668Z         D: int,
2025-05-07T20:32:16.6213764Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6213850Z         contiguous: bool,
2025-05-07T20:32:16.6213939Z         compiled: bool,
2025-05-07T20:32:16.6214014Z     ) -> None:
2025-05-07T20:32:16.6214105Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6214177Z     
2025-05-07T20:32:16.6214346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6214417Z     
2025-05-07T20:32:16.6214510Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6214634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6214723Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6214800Z         x0 = x[:, :D]
2025-05-07T20:32:16.6214884Z         x1 = x[:, D:]
2025-05-07T20:32:16.6214957Z     
2025-05-07T20:32:16.6215038Z         if contiguous:
2025-05-07T20:32:16.6215128Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6215302Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6215374Z     
2025-05-07T20:32:16.6215462Z         if scale_ub is not None:
2025-05-07T20:32:16.6215572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6215707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6215781Z             )
2025-05-07T20:32:16.6215858Z         else:
2025-05-07T20:32:16.6215948Z             scale_ub_tensor = None
2025-05-07T20:32:16.6216019Z     
2025-05-07T20:32:16.6216153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6216243Z             op = silu_mul_quant
2025-05-07T20:32:16.6216330Z             if compiled:
2025-05-07T20:32:16.6216425Z                 op = torch.compile(op)
2025-05-07T20:32:16.6216528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6216603Z     
2025-05-07T20:32:16.6216689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6216694Z 
2025-05-07T20:32:16.6216788Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6216925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6217023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6217120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6217502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6217591Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6218153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6218248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6218614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6218886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6219242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6219339Z     kernel = self.compile(
2025-05-07T20:32:16.6219732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6219910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6220041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6220047Z 
2025-05-07T20:32:16.6220255Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639a0b920>
2025-05-07T20:32:16.6221062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6221589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96382032e0>}
2025-05-07T20:32:16.6222372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6222567Z context = <triton._C.libtriton.ir.context object at 0x7f94f7e7a8b0>
2025-05-07T20:32:16.6222571Z 
2025-05-07T20:32:16.6222737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6223013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6223119Z                            module_map=module_map)
2025-05-07T20:32:16.6223279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6223382Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6223455Z E       ^
2025-05-07T20:32:16.6223819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6223900Z 
2025-05-07T20:32:16.6224334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6224339Z 
2025-05-07T20:32:16.6224438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6224670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6224746Z     T=4096,
2025-05-07T20:32:16.6224819Z     D=5120,
2025-05-07T20:32:16.6224904Z     scale_ub=1200.0,
2025-05-07T20:32:16.6224985Z     contiguous=True,
2025-05-07T20:32:16.6225065Z     compiled=True,
2025-05-07T20:32:16.6225140Z )
2025-05-07T20:32:16.6225362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6225542Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6225547Z 
2025-05-07T20:32:16.6225620Z     @given(
2025-05-07T20:32:16.6225738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6225842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6225955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6226070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6226184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6226256Z     )
2025-05-07T20:32:16.6226505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6226644Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6226719Z         self,
2025-05-07T20:32:16.6226796Z         T: int,
2025-05-07T20:32:16.6226872Z         D: int,
2025-05-07T20:32:16.6226966Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6227053Z         contiguous: bool,
2025-05-07T20:32:16.6227179Z         compiled: bool,
2025-05-07T20:32:16.6227255Z     ) -> None:
2025-05-07T20:32:16.6227350Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6227419Z     
2025-05-07T20:32:16.6227592Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6227668Z     
2025-05-07T20:32:16.6227757Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6227881Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6227972Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6228053Z         x0 = x[:, :D]
2025-05-07T20:32:16.6228132Z         x1 = x[:, D:]
2025-05-07T20:32:16.6228207Z     
2025-05-07T20:32:16.6228293Z         if contiguous:
2025-05-07T20:32:16.6228385Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6228470Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6228541Z     
2025-05-07T20:32:16.6228631Z         if scale_ub is not None:
2025-05-07T20:32:16.6228733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6228870Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6233629Z             )
2025-05-07T20:32:16.6233724Z         else:
2025-05-07T20:32:16.6233822Z             scale_ub_tensor = None
2025-05-07T20:32:16.6233898Z     
2025-05-07T20:32:16.6234042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6234137Z             op = silu_mul_quant
2025-05-07T20:32:16.6234224Z             if compiled:
2025-05-07T20:32:16.6234324Z                 op = torch.compile(op)
2025-05-07T20:32:16.6234432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6234502Z     
2025-05-07T20:32:16.6234593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6234601Z 
2025-05-07T20:32:16.6234701Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6234833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6234935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6235037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6235425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6235520Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6236196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6236295Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6236666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6236891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6237254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6237362Z     kernel = self.compile(
2025-05-07T20:32:16.6237781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6237964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6238095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6238100Z 
2025-05-07T20:32:16.6238316Z self = <triton.compiler.compiler.ASTSource object at 0x7f963946a840>
2025-05-07T20:32:16.6239124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6239642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7c860>}
2025-05-07T20:32:16.6240463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6240694Z context = <triton._C.libtriton.ir.context object at 0x7f94f7e6bb30>
2025-05-07T20:32:16.6240699Z 
2025-05-07T20:32:16.6240867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6241143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6241251Z                            module_map=module_map)
2025-05-07T20:32:16.6241416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6241512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6241591Z E       ^
2025-05-07T20:32:16.6241960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6241970Z 
2025-05-07T20:32:16.6242395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6242400Z 
2025-05-07T20:32:16.6242503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6242735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6242809Z     T=128,
2025-05-07T20:32:16.6242885Z     D=5120,
2025-05-07T20:32:16.6242973Z     scale_ub=1200.0,
2025-05-07T20:32:16.6243056Z     contiguous=False,
2025-05-07T20:32:16.6243136Z     compiled=True,
2025-05-07T20:32:16.6243209Z )
2025-05-07T20:32:16.6243433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6243611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6243615Z 
2025-05-07T20:32:16.6243692Z     @given(
2025-05-07T20:32:16.6243808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6243908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6244021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6244135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6244254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6244325Z     )
2025-05-07T20:32:16.6244574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6244751Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6244828Z         self,
2025-05-07T20:32:16.6244906Z         T: int,
2025-05-07T20:32:16.6244981Z         D: int,
2025-05-07T20:32:16.6245078Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6245174Z         contiguous: bool,
2025-05-07T20:32:16.6245258Z         compiled: bool,
2025-05-07T20:32:16.6245335Z     ) -> None:
2025-05-07T20:32:16.6245435Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6245507Z     
2025-05-07T20:32:16.6245678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6245754Z     
2025-05-07T20:32:16.6245844Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6245967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6246061Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6246140Z         x0 = x[:, :D]
2025-05-07T20:32:16.6246219Z         x1 = x[:, D:]
2025-05-07T20:32:16.6246289Z     
2025-05-07T20:32:16.6246370Z         if contiguous:
2025-05-07T20:32:16.6246468Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6246554Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6246623Z     
2025-05-07T20:32:16.6246713Z         if scale_ub is not None:
2025-05-07T20:32:16.6246816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6246955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6247033Z             )
2025-05-07T20:32:16.6247172Z         else:
2025-05-07T20:32:16.6247274Z             scale_ub_tensor = None
2025-05-07T20:32:16.6247364Z     
2025-05-07T20:32:16.6247501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6247589Z             op = silu_mul_quant
2025-05-07T20:32:16.6247673Z             if compiled:
2025-05-07T20:32:16.6247769Z                 op = torch.compile(op)
2025-05-07T20:32:16.6247919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6247989Z     
2025-05-07T20:32:16.6248079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6248083Z 
2025-05-07T20:32:16.6248188Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6248317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6248416Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6248524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6248902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6249001Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6249511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6249607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6249977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6250204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6250561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6250654Z     kernel = self.compile(
2025-05-07T20:32:16.6251045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6251224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6251353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6251360Z 
2025-05-07T20:32:16.6251567Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae4d10>
2025-05-07T20:32:16.6252458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6253069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7d580>}
2025-05-07T20:32:16.6253851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6254042Z context = <triton._C.libtriton.ir.context object at 0x7f94f7d5f070>
2025-05-07T20:32:16.6254049Z 
2025-05-07T20:32:16.6254216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6254484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6254589Z                            module_map=module_map)
2025-05-07T20:32:16.6254758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6254856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6254929Z E       ^
2025-05-07T20:32:16.6255304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6255308Z 
2025-05-07T20:32:16.6255735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6255740Z 
2025-05-07T20:32:16.6255842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6256070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6256188Z     T=16384,
2025-05-07T20:32:16.6256267Z     D=7168,
2025-05-07T20:32:16.6256346Z     scale_ub=1200.0,
2025-05-07T20:32:16.6256429Z     contiguous=True,
2025-05-07T20:32:16.6256512Z     compiled=True,
2025-05-07T20:32:16.6256582Z )
2025-05-07T20:32:16.6256802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6257023Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6257028Z 
2025-05-07T20:32:16.6257100Z     @given(
2025-05-07T20:32:16.6257232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6257329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6257441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6257563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6257674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6257746Z     )
2025-05-07T20:32:16.6258002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6258093Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6258166Z         self,
2025-05-07T20:32:16.6258243Z         T: int,
2025-05-07T20:32:16.6258316Z         D: int,
2025-05-07T20:32:16.6258414Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6258506Z         contiguous: bool,
2025-05-07T20:32:16.6258590Z         compiled: bool,
2025-05-07T20:32:16.6258668Z     ) -> None:
2025-05-07T20:32:16.6258759Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6258829Z     
2025-05-07T20:32:16.6259005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6259076Z     
2025-05-07T20:32:16.6259167Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6259294Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6259381Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6259456Z         x0 = x[:, :D]
2025-05-07T20:32:16.6259538Z         x1 = x[:, D:]
2025-05-07T20:32:16.6259610Z     
2025-05-07T20:32:16.6259694Z         if contiguous:
2025-05-07T20:32:16.6259783Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6259868Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6259940Z     
2025-05-07T20:32:16.6260028Z         if scale_ub is not None:
2025-05-07T20:32:16.6260129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6260269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6260341Z             )
2025-05-07T20:32:16.6260415Z         else:
2025-05-07T20:32:16.6260588Z             scale_ub_tensor = None
2025-05-07T20:32:16.6260660Z     
2025-05-07T20:32:16.6260788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6260883Z             op = silu_mul_quant
2025-05-07T20:32:16.6260967Z             if compiled:
2025-05-07T20:32:16.6261062Z                 op = torch.compile(op)
2025-05-07T20:32:16.6261170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6261247Z     
2025-05-07T20:32:16.6261339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6261343Z 
2025-05-07T20:32:16.6261438Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6261568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6261668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6261768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6262149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6262243Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6262759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6262856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6263219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6263443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6263861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6263952Z     kernel = self.compile(
2025-05-07T20:32:16.6264344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6264562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6264697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6264702Z 
2025-05-07T20:32:16.6264913Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae7800>
2025-05-07T20:32:16.6265722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6266249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7e0c0>}
2025-05-07T20:32:16.6267022Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6267215Z context = <triton._C.libtriton.ir.context object at 0x7f94f7dec530>
2025-05-07T20:32:16.6267219Z 
2025-05-07T20:32:16.6267391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6267660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6267768Z                            module_map=module_map)
2025-05-07T20:32:16.6267931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6268027Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6268108Z E       ^
2025-05-07T20:32:16.6268474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6268478Z 
2025-05-07T20:32:16.6268903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6268913Z 
2025-05-07T20:32:16.6269012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6269239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6269391Z     T=16384,
2025-05-07T20:32:16.6269468Z     D=5120,
2025-05-07T20:32:16.6269548Z     scale_ub=1200.0,
2025-05-07T20:32:16.6269631Z     contiguous=True,
2025-05-07T20:32:16.6269713Z     compiled=False,
2025-05-07T20:32:16.6269783Z )
2025-05-07T20:32:16.6270009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6270187Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6270194Z 
2025-05-07T20:32:16.6270274Z     @given(
2025-05-07T20:32:16.6270391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6270486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6270604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6270721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6270834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6270909Z     )
2025-05-07T20:32:16.6271164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6271255Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6271330Z         self,
2025-05-07T20:32:16.6271403Z         T: int,
2025-05-07T20:32:16.6271475Z         D: int,
2025-05-07T20:32:16.6271574Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6271661Z         contiguous: bool,
2025-05-07T20:32:16.6271748Z         compiled: bool,
2025-05-07T20:32:16.6271870Z     ) -> None:
2025-05-07T20:32:16.6271961Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6272032Z     
2025-05-07T20:32:16.6272202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6272275Z     
2025-05-07T20:32:16.6272367Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6272488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6272615Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6272695Z         x0 = x[:, :D]
2025-05-07T20:32:16.6272772Z         x1 = x[:, D:]
2025-05-07T20:32:16.6272843Z     
2025-05-07T20:32:16.6272931Z         if contiguous:
2025-05-07T20:32:16.6273019Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6273104Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6273182Z     
2025-05-07T20:32:16.6273270Z         if scale_ub is not None:
2025-05-07T20:32:16.6273376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6273510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6273585Z             )
2025-05-07T20:32:16.6273666Z         else:
2025-05-07T20:32:16.6273758Z             scale_ub_tensor = None
2025-05-07T20:32:16.6273827Z     
2025-05-07T20:32:16.6273957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6274044Z             op = silu_mul_quant
2025-05-07T20:32:16.6274130Z             if compiled:
2025-05-07T20:32:16.6274229Z                 op = torch.compile(op)
2025-05-07T20:32:16.6274331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6274400Z     
2025-05-07T20:32:16.6274497Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6274502Z 
2025-05-07T20:32:16.6274596Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6274728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6274824Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6274921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6275444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6275545Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6275918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6276149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6276504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6276681Z     kernel = self.compile(
2025-05-07T20:32:16.6277081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6277262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6277399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6277404Z 
2025-05-07T20:32:16.6277619Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639ae6f30>
2025-05-07T20:32:16.6278443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6278978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7e7f1a0>}
2025-05-07T20:32:16.6282314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6282534Z context = <triton._C.libtriton.ir.context object at 0x7f94f7dc6870>
2025-05-07T20:32:16.6282540Z 
2025-05-07T20:32:16.6282709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6283047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6283155Z                            module_map=module_map)
2025-05-07T20:32:16.6283317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6283417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6283534Z E       ^
2025-05-07T20:32:16.6283906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6283910Z 
2025-05-07T20:32:16.6284351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6284356Z 
2025-05-07T20:32:16.6284482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6284713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6284794Z     T=1,
2025-05-07T20:32:16.6284869Z     D=7168,
2025-05-07T20:32:16.6284952Z     scale_ub=1200.0,
2025-05-07T20:32:16.6285043Z     contiguous=False,
2025-05-07T20:32:16.6285126Z     compiled=False,
2025-05-07T20:32:16.6285200Z )
2025-05-07T20:32:16.6285426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6285599Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6285607Z 
2025-05-07T20:32:16.6285686Z     @given(
2025-05-07T20:32:16.6285805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6285902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6286020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6286138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6286254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6286333Z     )
2025-05-07T20:32:16.6286583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6286679Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6286759Z         self,
2025-05-07T20:32:16.6286838Z         T: int,
2025-05-07T20:32:16.6286916Z         D: int,
2025-05-07T20:32:16.6287018Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6287108Z         contiguous: bool,
2025-05-07T20:32:16.6287198Z         compiled: bool,
2025-05-07T20:32:16.6287280Z     ) -> None:
2025-05-07T20:32:16.6287374Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6287449Z     
2025-05-07T20:32:16.6287621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6287696Z     
2025-05-07T20:32:16.6287842Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6287968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6288059Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6288144Z         x0 = x[:, :D]
2025-05-07T20:32:16.6288230Z         x1 = x[:, D:]
2025-05-07T20:32:16.6288301Z     
2025-05-07T20:32:16.6288389Z         if contiguous:
2025-05-07T20:32:16.6288481Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6288570Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6288647Z     
2025-05-07T20:32:16.6288735Z         if scale_ub is not None:
2025-05-07T20:32:16.6288842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6288979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6289058Z             )
2025-05-07T20:32:16.6289138Z         else:
2025-05-07T20:32:16.6289232Z             scale_ub_tensor = None
2025-05-07T20:32:16.6289305Z     
2025-05-07T20:32:16.6289444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6289534Z             op = silu_mul_quant
2025-05-07T20:32:16.6289619Z             if compiled:
2025-05-07T20:32:16.6289810Z                 op = torch.compile(op)
2025-05-07T20:32:16.6289918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6289991Z     
2025-05-07T20:32:16.6290086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6290090Z 
2025-05-07T20:32:16.6290227Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6290361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6290460Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6290560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6291082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6291217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6291589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6291889Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6292244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6292344Z     kernel = self.compile(
2025-05-07T20:32:16.6292738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6292919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6293052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6293057Z 
2025-05-07T20:32:16.6293265Z self = <triton.compiler.compiler.ASTSource object at 0x7f9639cc82f0>
2025-05-07T20:32:16.6294081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6294603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc0680>}
2025-05-07T20:32:16.6295380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6295579Z context = <triton._C.libtriton.ir.context object at 0x7f9638bc8630>
2025-05-07T20:32:16.6295583Z 
2025-05-07T20:32:16.6295749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6296024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6296131Z                            module_map=module_map)
2025-05-07T20:32:16.6296362Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6296465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6296548Z E       ^
2025-05-07T20:32:16.6296915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6296920Z 
2025-05-07T20:32:16.6297350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6297361Z 
2025-05-07T20:32:16.6297461Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6297690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6297767Z     T=4096,
2025-05-07T20:32:16.6297842Z     D=7168,
2025-05-07T20:32:16.6297928Z     scale_ub=1200.0,
2025-05-07T20:32:16.6298017Z     contiguous=False,
2025-05-07T20:32:16.6298100Z     compiled=True,
2025-05-07T20:32:16.6298170Z )
2025-05-07T20:32:16.6298401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6298580Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6298585Z 
2025-05-07T20:32:16.6298662Z     @given(
2025-05-07T20:32:16.6298841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6298941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6299059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6299226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6299384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6299493Z     )
2025-05-07T20:32:16.6299817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6299912Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6300050Z         self,
2025-05-07T20:32:16.6300127Z         T: int,
2025-05-07T20:32:16.6300202Z         D: int,
2025-05-07T20:32:16.6300301Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6300392Z         contiguous: bool,
2025-05-07T20:32:16.6300480Z         compiled: bool,
2025-05-07T20:32:16.6300558Z     ) -> None:
2025-05-07T20:32:16.6300652Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6300732Z     
2025-05-07T20:32:16.6300904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6300977Z     
2025-05-07T20:32:16.6301073Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6301202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6301290Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6301372Z         x0 = x[:, :D]
2025-05-07T20:32:16.6301450Z         x1 = x[:, D:]
2025-05-07T20:32:16.6301520Z     
2025-05-07T20:32:16.6301603Z         if contiguous:
2025-05-07T20:32:16.6301692Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6301781Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6301856Z     
2025-05-07T20:32:16.6301945Z         if scale_ub is not None:
2025-05-07T20:32:16.6302053Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6302193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6302267Z             )
2025-05-07T20:32:16.6302344Z         else:
2025-05-07T20:32:16.6302439Z             scale_ub_tensor = None
2025-05-07T20:32:16.6302508Z     
2025-05-07T20:32:16.6302642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6302734Z             op = silu_mul_quant
2025-05-07T20:32:16.6302819Z             if compiled:
2025-05-07T20:32:16.6302921Z                 op = torch.compile(op)
2025-05-07T20:32:16.6303025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6303098Z     
2025-05-07T20:32:16.6303189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6303194Z 
2025-05-07T20:32:16.6303294Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6303426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6303524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6303670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6304052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6304146Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6304655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6304760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6305126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6305355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6305706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6305802Z     kernel = self.compile(
2025-05-07T20:32:16.6306695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6306881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6307100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6307105Z 
2025-05-07T20:32:16.6307317Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce184dd0>
2025-05-07T20:32:16.6308128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6308707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc1940>}
2025-05-07T20:32:16.6309554Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6309751Z context = <triton._C.libtriton.ir.context object at 0x7f9638be6670>
2025-05-07T20:32:16.6309755Z 
2025-05-07T20:32:16.6309924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6310193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6310305Z                            module_map=module_map)
2025-05-07T20:32:16.6310469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6310572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6310648Z E       ^
2025-05-07T20:32:16.6311015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6311022Z 
2025-05-07T20:32:16.6311454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6311461Z 
2025-05-07T20:32:16.6311564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6311802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6311878Z     T=128,
2025-05-07T20:32:16.6311952Z     D=7168,
2025-05-07T20:32:16.6312036Z     scale_ub=1200.0,
2025-05-07T20:32:16.6312125Z     contiguous=False,
2025-05-07T20:32:16.6312207Z     compiled=True,
2025-05-07T20:32:16.6312285Z )
2025-05-07T20:32:16.6312508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6312686Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6312690Z 
2025-05-07T20:32:16.6312770Z     @given(
2025-05-07T20:32:16.6312887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6312989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6313107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6313287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6313407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6313481Z     )
2025-05-07T20:32:16.6313733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6313832Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6313907Z         self,
2025-05-07T20:32:16.6313984Z         T: int,
2025-05-07T20:32:16.6314066Z         D: int,
2025-05-07T20:32:16.6314164Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6314253Z         contiguous: bool,
2025-05-07T20:32:16.6314341Z         compiled: bool,
2025-05-07T20:32:16.6314417Z     ) -> None:
2025-05-07T20:32:16.6314512Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6314584Z     
2025-05-07T20:32:16.6314760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6314838Z     
2025-05-07T20:32:16.6314928Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6315056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6315148Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6315225Z         x0 = x[:, :D]
2025-05-07T20:32:16.6315304Z         x1 = x[:, D:]
2025-05-07T20:32:16.6315429Z     
2025-05-07T20:32:16.6315515Z         if contiguous:
2025-05-07T20:32:16.6315606Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6315698Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6315807Z     
2025-05-07T20:32:16.6315896Z         if scale_ub is not None:
2025-05-07T20:32:16.6316002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6316137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6316217Z             )
2025-05-07T20:32:16.6316295Z         else:
2025-05-07T20:32:16.6316386Z             scale_ub_tensor = None
2025-05-07T20:32:16.6316502Z     
2025-05-07T20:32:16.6316633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6316721Z             op = silu_mul_quant
2025-05-07T20:32:16.6316807Z             if compiled:
2025-05-07T20:32:16.6316908Z                 op = torch.compile(op)
2025-05-07T20:32:16.6317012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6317087Z     
2025-05-07T20:32:16.6317181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6317185Z 
2025-05-07T20:32:16.6317283Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6317413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6317515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6317619Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6317997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6318088Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6318602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6318697Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6319071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6319300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6319650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6319747Z     kernel = self.compile(
2025-05-07T20:32:16.6320143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6320321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6320454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6320461Z 
2025-05-07T20:32:16.6320671Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce185ee0>
2025-05-07T20:32:16.6321528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6322051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc2700>}
2025-05-07T20:32:16.6322829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6323023Z context = <triton._C.libtriton.ir.context object at 0x7f94f7be14b0>
2025-05-07T20:32:16.6323028Z 
2025-05-07T20:32:16.6323195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6323473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6323583Z                            module_map=module_map)
2025-05-07T20:32:16.6323746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6323847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6323970Z E       ^
2025-05-07T20:32:16.6324339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6324343Z 
2025-05-07T20:32:16.6324813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6324818Z 
2025-05-07T20:32:16.6324920Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6325153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6325296Z     T=2048,
2025-05-07T20:32:16.6325377Z     D=7168,
2025-05-07T20:32:16.6325457Z     scale_ub=None,
2025-05-07T20:32:16.6325541Z     contiguous=True,
2025-05-07T20:32:16.6325624Z     compiled=True,
2025-05-07T20:32:16.6325696Z )
2025-05-07T20:32:16.6325923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6326099Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.6326106Z 
2025-05-07T20:32:16.6326181Z     @given(
2025-05-07T20:32:16.6326299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6326402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6326520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6326638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6326751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6326823Z     )
2025-05-07T20:32:16.6327078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6327173Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6327247Z         self,
2025-05-07T20:32:16.6327327Z         T: int,
2025-05-07T20:32:16.6327401Z         D: int,
2025-05-07T20:32:16.6327501Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6327594Z         contiguous: bool,
2025-05-07T20:32:16.6327678Z         compiled: bool,
2025-05-07T20:32:16.6327753Z     ) -> None:
2025-05-07T20:32:16.6327851Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6327925Z     
2025-05-07T20:32:16.6328094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6328174Z     
2025-05-07T20:32:16.6328266Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6328395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6328481Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6328558Z         x0 = x[:, :D]
2025-05-07T20:32:16.6328639Z         x1 = x[:, D:]
2025-05-07T20:32:16.6328709Z     
2025-05-07T20:32:16.6328797Z         if contiguous:
2025-05-07T20:32:16.6328890Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6328978Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6329051Z     
2025-05-07T20:32:16.6329144Z         if scale_ub is not None:
2025-05-07T20:32:16.6329294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6329435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6329517Z             )
2025-05-07T20:32:16.6329591Z         else:
2025-05-07T20:32:16.6329687Z             scale_ub_tensor = None
2025-05-07T20:32:16.6329759Z     
2025-05-07T20:32:16.6329887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6329982Z             op = silu_mul_quant
2025-05-07T20:32:16.6330064Z             if compiled:
2025-05-07T20:32:16.6330161Z                 op = torch.compile(op)
2025-05-07T20:32:16.6330271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6330341Z     
2025-05-07T20:32:16.6330431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6330439Z 
2025-05-07T20:32:16.6330537Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6330668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6330774Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6330874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6331301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6331401Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6331971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6332112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6332485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6332713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6333106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6333199Z     kernel = self.compile(
2025-05-07T20:32:16.6333596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6333778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6333911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6333915Z 
2025-05-07T20:32:16.6334129Z self = <triton.compiler.compiler.ASTSource object at 0x7f96ce184e00>
2025-05-07T20:32:16.6334940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6335461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9638bc37e0>}
2025-05-07T20:32:16.6336249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6336445Z context = <triton._C.libtriton.ir.context object at 0x7f94f7b20b70>
2025-05-07T20:32:16.6336449Z 
2025-05-07T20:32:16.6336618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6336889Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6336997Z                            module_map=module_map)
2025-05-07T20:32:16.6337163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6337261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6337338Z E       ^
2025-05-07T20:32:16.6337711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6337715Z 
2025-05-07T20:32:16.6338186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6338191Z 
2025-05-07T20:32:16.6338298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6338529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6338608Z     T=16384,
2025-05-07T20:32:16.6338687Z     D=5120,
2025-05-07T20:32:16.6338769Z     scale_ub=None,
2025-05-07T20:32:16.6338858Z     contiguous=False,
2025-05-07T20:32:16.6338945Z     compiled=False,
2025-05-07T20:32:16.6339015Z )
2025-05-07T20:32:16.6339241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6339423Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6339427Z 
2025-05-07T20:32:16.6339504Z     @given(
2025-05-07T20:32:16.6339628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6339726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6339843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6339963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6340076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6340201Z     )
2025-05-07T20:32:16.6340455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6340545Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6340623Z         self,
2025-05-07T20:32:16.6340737Z         T: int,
2025-05-07T20:32:16.6340812Z         D: int,
2025-05-07T20:32:16.6340912Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6341000Z         contiguous: bool,
2025-05-07T20:32:16.6341084Z         compiled: bool,
2025-05-07T20:32:16.6341167Z     ) -> None:
2025-05-07T20:32:16.6341262Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6341373Z     
2025-05-07T20:32:16.6341547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6341618Z     
2025-05-07T20:32:16.6341708Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6341843Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6343748Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6343762Z 
2025-05-07T20:32:16.6343882Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6343889Z 
2025-05-07T20:32:16.6343991Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6344222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6344298Z     T=4096,
2025-05-07T20:32:16.6344371Z     D=7168,
2025-05-07T20:32:16.6344456Z     scale_ub=1200.0,
2025-05-07T20:32:16.6344540Z     contiguous=True,
2025-05-07T20:32:16.6344628Z     compiled=True,
2025-05-07T20:32:16.6344702Z )
2025-05-07T20:32:16.6344927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6345108Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6345116Z 
2025-05-07T20:32:16.6345190Z     @given(
2025-05-07T20:32:16.6345307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6345411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6345524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6345641Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6345759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6345831Z     )
2025-05-07T20:32:16.6346133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6346235Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6346312Z         self,
2025-05-07T20:32:16.6346391Z         T: int,
2025-05-07T20:32:16.6346472Z         D: int,
2025-05-07T20:32:16.6346568Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6346660Z         contiguous: bool,
2025-05-07T20:32:16.6346745Z         compiled: bool,
2025-05-07T20:32:16.6346825Z     ) -> None:
2025-05-07T20:32:16.6346922Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6346994Z     
2025-05-07T20:32:16.6347161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6347237Z     
2025-05-07T20:32:16.6347329Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6347452Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6349385Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6349427Z 
2025-05-07T20:32:16.6349547Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6349555Z 
2025-05-07T20:32:16.6349656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6349884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6349962Z     T=16384,
2025-05-07T20:32:16.6350084Z     D=7168,
2025-05-07T20:32:16.6350165Z     scale_ub=None,
2025-05-07T20:32:16.6350254Z     contiguous=False,
2025-05-07T20:32:16.6350341Z     compiled=False,
2025-05-07T20:32:16.6350411Z )
2025-05-07T20:32:16.6350639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6350820Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6350825Z 
2025-05-07T20:32:16.6350902Z     @given(
2025-05-07T20:32:16.6351023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6351122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6351237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6351353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6351465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6351541Z     )
2025-05-07T20:32:16.6351794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6351891Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6351969Z         self,
2025-05-07T20:32:16.6352043Z         T: int,
2025-05-07T20:32:16.6352118Z         D: int,
2025-05-07T20:32:16.6352217Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6352308Z         contiguous: bool,
2025-05-07T20:32:16.6352397Z         compiled: bool,
2025-05-07T20:32:16.6352475Z     ) -> None:
2025-05-07T20:32:16.6352569Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6352642Z     
2025-05-07T20:32:16.6352811Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6354683Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6354698Z 
2025-05-07T20:32:16.6354861Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6354866Z 
2025-05-07T20:32:16.6354969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6355200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6355275Z     T=2048,
2025-05-07T20:32:16.6355349Z     D=7168,
2025-05-07T20:32:16.6355434Z     scale_ub=1200.0,
2025-05-07T20:32:16.6355517Z     contiguous=True,
2025-05-07T20:32:16.6355602Z     compiled=True,
2025-05-07T20:32:16.6355677Z )
2025-05-07T20:32:16.6355900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6356078Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6356083Z 
2025-05-07T20:32:16.6356158Z     @given(
2025-05-07T20:32:16.6356279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6356381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6356496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6356613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6356730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6356803Z     )
2025-05-07T20:32:16.6357106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6357202Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6357277Z         self,
2025-05-07T20:32:16.6357421Z         T: int,
2025-05-07T20:32:16.6357495Z         D: int,
2025-05-07T20:32:16.6357591Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6357684Z         contiguous: bool,
2025-05-07T20:32:16.6357769Z         compiled: bool,
2025-05-07T20:32:16.6357849Z     ) -> None:
2025-05-07T20:32:16.6357947Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6358059Z     
2025-05-07T20:32:16.6358230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6358307Z     
2025-05-07T20:32:16.6358398Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6358529Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6360389Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6360397Z 
2025-05-07T20:32:16.6360517Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6360524Z 
2025-05-07T20:32:16.6360625Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6365678Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6365779Z     T=2048,
2025-05-07T20:32:16.6365861Z     D=7168,
2025-05-07T20:32:16.6365952Z     scale_ub=None,
2025-05-07T20:32:16.6366040Z     contiguous=True,
2025-05-07T20:32:16.6366126Z     compiled=False,
2025-05-07T20:32:16.6366206Z )
2025-05-07T20:32:16.6366437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6366614Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6366622Z 
2025-05-07T20:32:16.6366701Z     @given(
2025-05-07T20:32:16.6366820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6366920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6367038Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6367152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6367274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6367349Z     )
2025-05-07T20:32:16.6367602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6367768Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6367846Z         self,
2025-05-07T20:32:16.6367924Z         T: int,
2025-05-07T20:32:16.6368005Z         D: int,
2025-05-07T20:32:16.6368107Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6368196Z         contiguous: bool,
2025-05-07T20:32:16.6368285Z         compiled: bool,
2025-05-07T20:32:16.6368365Z     ) -> None:
2025-05-07T20:32:16.6368462Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6368537Z     
2025-05-07T20:32:16.6368711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6368788Z     
2025-05-07T20:32:16.6368881Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.6370794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6370808Z 
2025-05-07T20:32:16.6370931Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.6370936Z 
2025-05-07T20:32:16.6371074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6371304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6371379Z     T=1,
2025-05-07T20:32:16.6371457Z     D=7168,
2025-05-07T20:32:16.6371544Z     scale_ub=1200.0,
2025-05-07T20:32:16.6371626Z     contiguous=True,
2025-05-07T20:32:16.6371835Z     compiled=False,
2025-05-07T20:32:16.6371914Z )
2025-05-07T20:32:16.6372140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6372317Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6372322Z 
2025-05-07T20:32:16.6372401Z     @given(
2025-05-07T20:32:16.6372519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6372619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6372736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6372855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6372974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6373045Z     )
2025-05-07T20:32:16.6373297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6373396Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6373473Z         self,
2025-05-07T20:32:16.6373548Z         T: int,
2025-05-07T20:32:16.6373630Z         D: int,
2025-05-07T20:32:16.6373726Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6373814Z         contiguous: bool,
2025-05-07T20:32:16.6373906Z         compiled: bool,
2025-05-07T20:32:16.6373986Z     ) -> None:
2025-05-07T20:32:16.6374080Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6374157Z     
2025-05-07T20:32:16.6374330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6374408Z     
2025-05-07T20:32:16.6374499Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6374624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6374719Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6374798Z         x0 = x[:, :D]
2025-05-07T20:32:16.6374875Z         x1 = x[:, D:]
2025-05-07T20:32:16.6374952Z     
2025-05-07T20:32:16.6375034Z         if contiguous:
2025-05-07T20:32:16.6375124Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6375216Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6375290Z     
2025-05-07T20:32:16.6375380Z         if scale_ub is not None:
2025-05-07T20:32:16.6375489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6375625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6375750Z             )
2025-05-07T20:32:16.6375826Z         else:
2025-05-07T20:32:16.6375918Z             scale_ub_tensor = None
2025-05-07T20:32:16.6375991Z     
2025-05-07T20:32:16.6376124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6376212Z             op = silu_mul_quant
2025-05-07T20:32:16.6376300Z             if compiled:
2025-05-07T20:32:16.6376402Z                 op = torch.compile(op)
2025-05-07T20:32:16.6376508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6376582Z     
2025-05-07T20:32:16.6376671Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6376675Z 
2025-05-07T20:32:16.6376770Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6376907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6377008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6377113Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6377636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6377735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6378159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6378389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6378788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6378882Z     kernel = self.compile(
2025-05-07T20:32:16.6379278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6379500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6379633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6379638Z 
2025-05-07T20:32:16.6379850Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638378c80>
2025-05-07T20:32:16.6380663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6381180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7b3ab60>}
2025-05-07T20:32:16.6381961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6382156Z context = <triton._C.libtriton.ir.context object at 0x7f94f7c806f0>
2025-05-07T20:32:16.6382161Z 
2025-05-07T20:32:16.6382330Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6382602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6382712Z                            module_map=module_map)
2025-05-07T20:32:16.6382879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6382977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6383054Z E       ^
2025-05-07T20:32:16.6383429Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6383433Z 
2025-05-07T20:32:16.6383859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6383863Z 
2025-05-07T20:32:16.6383970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6384198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6384275Z     T=128,
2025-05-07T20:32:16.6384355Z     D=5120,
2025-05-07T20:32:16.6384478Z     scale_ub=None,
2025-05-07T20:32:16.6384564Z     contiguous=True,
2025-05-07T20:32:16.6384649Z     compiled=False,
2025-05-07T20:32:16.6384722Z )
2025-05-07T20:32:16.6384947Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6385127Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6385131Z 
2025-05-07T20:32:16.6385209Z     @given(
2025-05-07T20:32:16.6385332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6385429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6385545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6385663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6385778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6385849Z     )
2025-05-07T20:32:16.6386103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6386197Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6386276Z         self,
2025-05-07T20:32:16.6386350Z         T: int,
2025-05-07T20:32:16.6386424Z         D: int,
2025-05-07T20:32:16.6386575Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6386665Z         contiguous: bool,
2025-05-07T20:32:16.6386748Z         compiled: bool,
2025-05-07T20:32:16.6386832Z     ) -> None:
2025-05-07T20:32:16.6386926Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6387036Z     
2025-05-07T20:32:16.6387210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6387281Z     
2025-05-07T20:32:16.6387371Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6387499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6387587Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6387711Z         x0 = x[:, :D]
2025-05-07T20:32:16.6387788Z         x1 = x[:, D:]
2025-05-07T20:32:16.6387859Z     
2025-05-07T20:32:16.6387946Z         if contiguous:
2025-05-07T20:32:16.6388039Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6388127Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6388204Z     
2025-05-07T20:32:16.6388293Z         if scale_ub is not None:
2025-05-07T20:32:16.6388400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6388540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6388615Z             )
2025-05-07T20:32:16.6388694Z         else:
2025-05-07T20:32:16.6388790Z             scale_ub_tensor = None
2025-05-07T20:32:16.6388860Z     
2025-05-07T20:32:16.6388990Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6389082Z             op = silu_mul_quant
2025-05-07T20:32:16.6389163Z             if compiled:
2025-05-07T20:32:16.6389262Z                 op = torch.compile(op)
2025-05-07T20:32:16.6389369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6389440Z     
2025-05-07T20:32:16.6389532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6389537Z 
2025-05-07T20:32:16.6389635Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6389768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6389875Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6389975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6390490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6390594Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6390965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6391195Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6391549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6391643Z     kernel = self.compile(
2025-05-07T20:32:16.6392085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6392263Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6392401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6392405Z 
2025-05-07T20:32:16.6392612Z self = <triton.compiler.compiler.ASTSource object at 0x7f9638379370>
2025-05-07T20:32:16.6393420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6393941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7b3bc40>}
2025-05-07T20:32:16.6394721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6394989Z context = <triton._C.libtriton.ir.context object at 0x7f94f7af75b0>
2025-05-07T20:32:16.6394994Z 
2025-05-07T20:32:16.6395162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6395431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6395582Z                            module_map=module_map)
2025-05-07T20:32:16.6395745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6395848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6395925Z E       ^
2025-05-07T20:32:16.6396291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6396333Z 
2025-05-07T20:32:16.6396770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6396775Z 
2025-05-07T20:32:16.6396877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6397111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6397187Z     T=128,
2025-05-07T20:32:16.6397263Z     D=7168,
2025-05-07T20:32:16.6397348Z     scale_ub=None,
2025-05-07T20:32:16.6397432Z     contiguous=True,
2025-05-07T20:32:16.6397520Z     compiled=False,
2025-05-07T20:32:16.6397595Z )
2025-05-07T20:32:16.6397819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6397992Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6397998Z 
2025-05-07T20:32:16.6398097Z     @given(
2025-05-07T20:32:16.6398234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6398348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6398462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6398579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6398697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6398770Z     )
2025-05-07T20:32:16.6399025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6399121Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6399197Z         self,
2025-05-07T20:32:16.6399278Z         T: int,
2025-05-07T20:32:16.6399357Z         D: int,
2025-05-07T20:32:16.6399454Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6399544Z         contiguous: bool,
2025-05-07T20:32:16.6399632Z         compiled: bool,
2025-05-07T20:32:16.6399712Z     ) -> None:
2025-05-07T20:32:16.6399809Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6399884Z     
2025-05-07T20:32:16.6400053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6400129Z     
2025-05-07T20:32:16.6400221Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6400389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6400487Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6400566Z         x0 = x[:, :D]
2025-05-07T20:32:16.6400649Z         x1 = x[:, D:]
2025-05-07T20:32:16.6400726Z     
2025-05-07T20:32:16.6400809Z         if contiguous:
2025-05-07T20:32:16.6400899Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6400993Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6401069Z     
2025-05-07T20:32:16.6401159Z         if scale_ub is not None:
2025-05-07T20:32:16.6401270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6401406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6401483Z             )
2025-05-07T20:32:16.6401559Z         else:
2025-05-07T20:32:16.6401657Z             scale_ub_tensor = None
2025-05-07T20:32:16.6401730Z     
2025-05-07T20:32:16.6401859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6401949Z             op = silu_mul_quant
2025-05-07T20:32:16.6402042Z             if compiled:
2025-05-07T20:32:16.6402140Z                 op = torch.compile(op)
2025-05-07T20:32:16.6402245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6402371Z     
2025-05-07T20:32:16.6402465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6402469Z 
2025-05-07T20:32:16.6402575Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6402708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6402847Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6402950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6403468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6403603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6403977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6404209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6404564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6404661Z     kernel = self.compile(
2025-05-07T20:32:16.6405055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6405237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6405365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6405370Z 
2025-05-07T20:32:16.6405578Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7af1c10>
2025-05-07T20:32:16.6406637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6407164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7adcae0>}
2025-05-07T20:32:16.6407997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6408194Z context = <triton._C.libtriton.ir.context object at 0x7f94f7ac4bb0>
2025-05-07T20:32:16.6408199Z 
2025-05-07T20:32:16.6408371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6408639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6408748Z                            module_map=module_map)
2025-05-07T20:32:16.6408912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6409092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6409171Z E       ^
2025-05-07T20:32:16.6409542Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6409547Z 
2025-05-07T20:32:16.6409974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6409978Z 
2025-05-07T20:32:16.6410089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6410317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6410392Z     T=2048,
2025-05-07T20:32:16.6410469Z     D=7168,
2025-05-07T20:32:16.6410553Z     scale_ub=1200.0,
2025-05-07T20:32:16.6410638Z     contiguous=True,
2025-05-07T20:32:16.6410729Z     compiled=False,
2025-05-07T20:32:16.6410800Z )
2025-05-07T20:32:16.6411026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6411207Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6411211Z 
2025-05-07T20:32:16.6411287Z     @given(
2025-05-07T20:32:16.6411408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6411573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6411690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6411863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6412045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6412125Z     )
2025-05-07T20:32:16.6412376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6412469Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6412550Z         self,
2025-05-07T20:32:16.6412689Z         T: int,
2025-05-07T20:32:16.6412769Z         D: int,
2025-05-07T20:32:16.6412869Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6412960Z         contiguous: bool,
2025-05-07T20:32:16.6413045Z         compiled: bool,
2025-05-07T20:32:16.6413128Z     ) -> None:
2025-05-07T20:32:16.6413222Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6413295Z     
2025-05-07T20:32:16.6413473Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6415339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6415349Z 
2025-05-07T20:32:16.6415469Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6415474Z 
2025-05-07T20:32:16.6415575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6415808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6415881Z     T=1,
2025-05-07T20:32:16.6415956Z     D=5120,
2025-05-07T20:32:16.6416042Z     scale_ub=1200.0,
2025-05-07T20:32:16.6416126Z     contiguous=True,
2025-05-07T20:32:16.6416208Z     compiled=False,
2025-05-07T20:32:16.6416282Z )
2025-05-07T20:32:16.6416506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6416678Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6416682Z 
2025-05-07T20:32:16.6416762Z     @given(
2025-05-07T20:32:16.6416881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6416978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6417097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6417211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6417370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6417444Z     )
2025-05-07T20:32:16.6417692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6417785Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6417857Z         self,
2025-05-07T20:32:16.6417930Z         T: int,
2025-05-07T20:32:16.6418005Z         D: int,
2025-05-07T20:32:16.6418098Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6418189Z         contiguous: bool,
2025-05-07T20:32:16.6418268Z         compiled: bool,
2025-05-07T20:32:16.6418339Z     ) -> None:
2025-05-07T20:32:16.6418433Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6418504Z     
2025-05-07T20:32:16.6418672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6418751Z     
2025-05-07T20:32:16.6418840Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6418961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6419047Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6419125Z         x0 = x[:, :D]
2025-05-07T20:32:16.6419205Z         x1 = x[:, D:]
2025-05-07T20:32:16.6419278Z     
2025-05-07T20:32:16.6419356Z         if contiguous:
2025-05-07T20:32:16.6419492Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6419583Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6419652Z     
2025-05-07T20:32:16.6419744Z         if scale_ub is not None:
2025-05-07T20:32:16.6419883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6420020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6420097Z             )
2025-05-07T20:32:16.6420172Z         else:
2025-05-07T20:32:16.6420265Z             scale_ub_tensor = None
2025-05-07T20:32:16.6420338Z     
2025-05-07T20:32:16.6420508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6420595Z             op = silu_mul_quant
2025-05-07T20:32:16.6420683Z             if compiled:
2025-05-07T20:32:16.6420780Z                 op = torch.compile(op)
2025-05-07T20:32:16.6420887Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6420961Z     
2025-05-07T20:32:16.6421050Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6421054Z 
2025-05-07T20:32:16.6421154Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6421287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6421386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6421489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6422007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6422109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6422479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6422710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6423065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6423158Z     kernel = self.compile(
2025-05-07T20:32:16.6423558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6423739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6423868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6423875Z 
2025-05-07T20:32:16.6424088Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7af1760>
2025-05-07T20:32:16.6424906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6425512Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7ade0c0>}
2025-05-07T20:32:16.6426307Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6426502Z context = <triton._C.libtriton.ir.context object at 0x7f94f7aefe30>
2025-05-07T20:32:16.6426512Z 
2025-05-07T20:32:16.6426679Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6426957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6427065Z                            module_map=module_map)
2025-05-07T20:32:16.6427231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6427329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6427407Z E       ^
2025-05-07T20:32:16.6427782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6427786Z 
2025-05-07T20:32:16.6428279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6428284Z 
2025-05-07T20:32:16.6428388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6428619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6428738Z     T=2048,
2025-05-07T20:32:16.6428812Z     D=5120,
2025-05-07T20:32:16.6428892Z     scale_ub=None,
2025-05-07T20:32:16.6428979Z     contiguous=True,
2025-05-07T20:32:16.6429062Z     compiled=False,
2025-05-07T20:32:16.6429132Z )
2025-05-07T20:32:16.6429361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6429579Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6429583Z 
2025-05-07T20:32:16.6429661Z     @given(
2025-05-07T20:32:16.6429783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6429883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6430003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6430119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6430232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6430307Z     )
2025-05-07T20:32:16.6430562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6430656Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6430732Z         self,
2025-05-07T20:32:16.6430807Z         T: int,
2025-05-07T20:32:16.6430887Z         D: int,
2025-05-07T20:32:16.6430983Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6431076Z         contiguous: bool,
2025-05-07T20:32:16.6431163Z         compiled: bool,
2025-05-07T20:32:16.6431239Z     ) -> None:
2025-05-07T20:32:16.6431330Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6431405Z     
2025-05-07T20:32:16.6431577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6431648Z     
2025-05-07T20:32:16.6431741Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.6433610Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6433622Z 
2025-05-07T20:32:16.6433742Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.6433747Z 
2025-05-07T20:32:16.6433846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6434122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6434199Z     T=16384,
2025-05-07T20:32:16.6434272Z     D=5120,
2025-05-07T20:32:16.6434357Z     scale_ub=None,
2025-05-07T20:32:16.6434441Z     contiguous=True,
2025-05-07T20:32:16.6434524Z     compiled=False,
2025-05-07T20:32:16.6434598Z )
2025-05-07T20:32:16.6434821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6435003Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6435008Z 
2025-05-07T20:32:16.6435086Z     @given(
2025-05-07T20:32:16.6435202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6435306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6435422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6435537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6435652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6435727Z     )
2025-05-07T20:32:16.6435976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6436073Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6436190Z         self,
2025-05-07T20:32:16.6436268Z         T: int,
2025-05-07T20:32:16.6436347Z         D: int,
2025-05-07T20:32:16.6436442Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6436567Z         contiguous: bool,
2025-05-07T20:32:16.6436655Z         compiled: bool,
2025-05-07T20:32:16.6436732Z     ) -> None:
2025-05-07T20:32:16.6436829Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6436899Z     
2025-05-07T20:32:16.6437067Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6438986Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6439031Z 
2025-05-07T20:32:16.6439150Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6439156Z 
2025-05-07T20:32:16.6439260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6439487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6439561Z     T=4096,
2025-05-07T20:32:16.6439638Z     D=5120,
2025-05-07T20:32:16.6439719Z     scale_ub=None,
2025-05-07T20:32:16.6439805Z     contiguous=True,
2025-05-07T20:32:16.6439890Z     compiled=False,
2025-05-07T20:32:16.6439961Z )
2025-05-07T20:32:16.6440186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6440363Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6440368Z 
2025-05-07T20:32:16.6440444Z     @given(
2025-05-07T20:32:16.6440564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6440661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6440772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6440892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6441010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6441082Z     )
2025-05-07T20:32:16.6441338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6441427Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6441507Z         self,
2025-05-07T20:32:16.6441582Z         T: int,
2025-05-07T20:32:16.6441658Z         D: int,
2025-05-07T20:32:16.6441757Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6441843Z         contiguous: bool,
2025-05-07T20:32:16.6441970Z         compiled: bool,
2025-05-07T20:32:16.6442050Z     ) -> None:
2025-05-07T20:32:16.6442142Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6442214Z     
2025-05-07T20:32:16.6442387Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6444234Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6444244Z 
2025-05-07T20:32:16.6444364Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6444368Z 
2025-05-07T20:32:16.6444471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6444700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6444775Z     T=2048,
2025-05-07T20:32:16.6444890Z     D=5120,
2025-05-07T20:32:16.6444977Z     scale_ub=None,
2025-05-07T20:32:16.6445061Z     contiguous=False,
2025-05-07T20:32:16.6445144Z     compiled=False,
2025-05-07T20:32:16.6445217Z )
2025-05-07T20:32:16.6445480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6445655Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6445659Z 
2025-05-07T20:32:16.6445739Z     @given(
2025-05-07T20:32:16.6445856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6445994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6446106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6446222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6446341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6446414Z     )
2025-05-07T20:32:16.6446668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6446763Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6446838Z         self,
2025-05-07T20:32:16.6446913Z         T: int,
2025-05-07T20:32:16.6446992Z         D: int,
2025-05-07T20:32:16.6447095Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6447187Z         contiguous: bool,
2025-05-07T20:32:16.6447295Z         compiled: bool,
2025-05-07T20:32:16.6447377Z     ) -> None:
2025-05-07T20:32:16.6447492Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6447563Z     
2025-05-07T20:32:16.6447730Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6449586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6449595Z 
2025-05-07T20:32:16.6449715Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6449719Z 
2025-05-07T20:32:16.6449821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6450047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6450122Z     T=4096,
2025-05-07T20:32:16.6450203Z     D=7168,
2025-05-07T20:32:16.6450288Z     scale_ub=None,
2025-05-07T20:32:16.6450370Z     contiguous=True,
2025-05-07T20:32:16.6450455Z     compiled=True,
2025-05-07T20:32:16.6450526Z )
2025-05-07T20:32:16.6450797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6450971Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.6450976Z 
2025-05-07T20:32:16.6451052Z     @given(
2025-05-07T20:32:16.6451171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6451268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6451381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6451503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6451614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6451686Z     )
2025-05-07T20:32:16.6452005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6452100Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6452179Z         self,
2025-05-07T20:32:16.6452254Z         T: int,
2025-05-07T20:32:16.6452329Z         D: int,
2025-05-07T20:32:16.6452431Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6452522Z         contiguous: bool,
2025-05-07T20:32:16.6452605Z         compiled: bool,
2025-05-07T20:32:16.6452686Z     ) -> None:
2025-05-07T20:32:16.6452825Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6452897Z     
2025-05-07T20:32:16.6453067Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6454926Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6455029Z 
2025-05-07T20:32:16.6455155Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6455160Z 
2025-05-07T20:32:16.6455259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6455492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6455567Z     T=2048,
2025-05-07T20:32:16.6455640Z     D=5120,
2025-05-07T20:32:16.6455724Z     scale_ub=1200.0,
2025-05-07T20:32:16.6455810Z     contiguous=False,
2025-05-07T20:32:16.6455895Z     compiled=False,
2025-05-07T20:32:16.6455970Z )
2025-05-07T20:32:16.6456191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6456369Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6456374Z 
2025-05-07T20:32:16.6456456Z     @given(
2025-05-07T20:32:16.6456578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6456679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6456791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6456908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6457024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6457098Z     )
2025-05-07T20:32:16.6457352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6457449Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6457523Z         self,
2025-05-07T20:32:16.6457599Z         T: int,
2025-05-07T20:32:16.6457676Z         D: int,
2025-05-07T20:32:16.6457770Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6457855Z         contiguous: bool,
2025-05-07T20:32:16.6457944Z         compiled: bool,
2025-05-07T20:32:16.6458017Z     ) -> None:
2025-05-07T20:32:16.6458114Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6458190Z     
2025-05-07T20:32:16.6458356Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6460253Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6460262Z 
2025-05-07T20:32:16.6460380Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6460384Z 
2025-05-07T20:32:16.6460486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6460712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6460792Z     T=4096,
2025-05-07T20:32:16.6460872Z     D=7168,
2025-05-07T20:32:16.6460953Z     scale_ub=1200.0,
2025-05-07T20:32:16.6461034Z     contiguous=True,
2025-05-07T20:32:16.6461118Z     compiled=False,
2025-05-07T20:32:16.6461192Z )
2025-05-07T20:32:16.6461419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6461636Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6461641Z 
2025-05-07T20:32:16.6461718Z     @given(
2025-05-07T20:32:16.6461839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6461974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6462086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6462207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6462318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6462391Z     )
2025-05-07T20:32:16.6462644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6462776Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6462853Z         self,
2025-05-07T20:32:16.6462927Z         T: int,
2025-05-07T20:32:16.6463004Z         D: int,
2025-05-07T20:32:16.6463103Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6463189Z         contiguous: bool,
2025-05-07T20:32:16.6463271Z         compiled: bool,
2025-05-07T20:32:16.6463352Z     ) -> None:
2025-05-07T20:32:16.6463445Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6463517Z     
2025-05-07T20:32:16.6463686Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6465542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6465552Z 
2025-05-07T20:32:16.6465670Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6465675Z 
2025-05-07T20:32:16.6465777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6466006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6466083Z     T=16384,
2025-05-07T20:32:16.6466157Z     D=7168,
2025-05-07T20:32:16.6466242Z     scale_ub=None,
2025-05-07T20:32:16.6466326Z     contiguous=False,
2025-05-07T20:32:16.6466410Z     compiled=True,
2025-05-07T20:32:16.6466485Z )
2025-05-07T20:32:16.6466706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6466883Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.6466891Z 
2025-05-07T20:32:16.6466970Z     @given(
2025-05-07T20:32:16.6467087Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6467187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6467345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6467461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6467577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6467664Z     )
2025-05-07T20:32:16.6467950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6468045Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6468122Z         self,
2025-05-07T20:32:16.6468198Z         T: int,
2025-05-07T20:32:16.6468276Z         D: int,
2025-05-07T20:32:16.6468372Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6468457Z         contiguous: bool,
2025-05-07T20:32:16.6468546Z         compiled: bool,
2025-05-07T20:32:16.6468620Z     ) -> None:
2025-05-07T20:32:16.6468720Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6468792Z     
2025-05-07T20:32:16.6468958Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6470860Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6470902Z 
2025-05-07T20:32:16.6471020Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6471025Z 
2025-05-07T20:32:16.6471127Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6471392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6471466Z     T=4096,
2025-05-07T20:32:16.6471546Z     D=7168,
2025-05-07T20:32:16.6471625Z     scale_ub=None,
2025-05-07T20:32:16.6471711Z     contiguous=True,
2025-05-07T20:32:16.6471794Z     compiled=False,
2025-05-07T20:32:16.6471865Z )
2025-05-07T20:32:16.6472092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6472266Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6472270Z 
2025-05-07T20:32:16.6472346Z     @given(
2025-05-07T20:32:16.6472464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6472563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6472675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6472791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6472907Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6472983Z     )
2025-05-07T20:32:16.6473236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6473326Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6473403Z         self,
2025-05-07T20:32:16.6473483Z         T: int,
2025-05-07T20:32:16.6473556Z         D: int,
2025-05-07T20:32:16.6473655Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6473746Z         contiguous: bool,
2025-05-07T20:32:16.6473830Z         compiled: bool,
2025-05-07T20:32:16.6473910Z     ) -> None:
2025-05-07T20:32:16.6474002Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6474072Z     
2025-05-07T20:32:16.6474246Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6476149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6476158Z 
2025-05-07T20:32:16.6476279Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6476286Z 
2025-05-07T20:32:16.6476385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6476617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6476695Z     T=16384,
2025-05-07T20:32:16.6476770Z     D=7168,
2025-05-07T20:32:16.6476855Z     scale_ub=None,
2025-05-07T20:32:16.6476938Z     contiguous=True,
2025-05-07T20:32:16.6477021Z     compiled=False,
2025-05-07T20:32:16.6477095Z )
2025-05-07T20:32:16.6477315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6477494Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6477499Z 
2025-05-07T20:32:16.6477576Z     @given(
2025-05-07T20:32:16.6477691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6477797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6477909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6478083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6478214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6478304Z     )
2025-05-07T20:32:16.6478557Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6478691Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6478766Z         self,
2025-05-07T20:32:16.6478842Z         T: int,
2025-05-07T20:32:16.6478919Z         D: int,
2025-05-07T20:32:16.6479014Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6479101Z         contiguous: bool,
2025-05-07T20:32:16.6479230Z         compiled: bool,
2025-05-07T20:32:16.6479306Z     ) -> None:
2025-05-07T20:32:16.6479403Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6479475Z     
2025-05-07T20:32:16.6479644Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6481503Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6481510Z 
2025-05-07T20:32:16.6481626Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6481631Z 
2025-05-07T20:32:16.6481736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6481962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6482039Z     T=16384,
2025-05-07T20:32:16.6482118Z     D=7168,
2025-05-07T20:32:16.6482202Z     scale_ub=1200.0,
2025-05-07T20:32:16.6482285Z     contiguous=True,
2025-05-07T20:32:16.6482370Z     compiled=False,
2025-05-07T20:32:16.6482442Z )
2025-05-07T20:32:16.6482669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6482849Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6482856Z 
2025-05-07T20:32:16.6482929Z     @given(
2025-05-07T20:32:16.6483048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6483144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6483259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6483376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6483490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6483562Z     )
2025-05-07T20:32:16.6483818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6483954Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6484035Z         self,
2025-05-07T20:32:16.6484109Z         T: int,
2025-05-07T20:32:16.6484184Z         D: int,
2025-05-07T20:32:16.6484283Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6484371Z         contiguous: bool,
2025-05-07T20:32:16.6484455Z         compiled: bool,
2025-05-07T20:32:16.6484536Z     ) -> None:
2025-05-07T20:32:16.6484632Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6484703Z     
2025-05-07T20:32:16.6484873Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6486732Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6486781Z 
2025-05-07T20:32:16.6486902Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6486907Z 
2025-05-07T20:32:16.6487007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6487261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6487416Z     T=128,
2025-05-07T20:32:16.6487497Z     D=5120,
2025-05-07T20:32:16.6487579Z     scale_ub=1200.0,
2025-05-07T20:32:16.6487664Z     contiguous=False,
2025-05-07T20:32:16.6487748Z     compiled=False,
2025-05-07T20:32:16.6487827Z )
2025-05-07T20:32:16.6488090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6488389Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6488397Z 
2025-05-07T20:32:16.6488515Z     @given(
2025-05-07T20:32:16.6488640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6488742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6488857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6488974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6489089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6489165Z     )
2025-05-07T20:32:16.6489416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6494086Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6494184Z         self,
2025-05-07T20:32:16.6494264Z         T: int,
2025-05-07T20:32:16.6494346Z         D: int,
2025-05-07T20:32:16.6494449Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6494550Z         contiguous: bool,
2025-05-07T20:32:16.6494643Z         compiled: bool,
2025-05-07T20:32:16.6494724Z     ) -> None:
2025-05-07T20:32:16.6494820Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6494899Z     
2025-05-07T20:32:16.6495076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6495153Z     
2025-05-07T20:32:16.6495248Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6495375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6495467Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6495550Z         x0 = x[:, :D]
2025-05-07T20:32:16.6495633Z         x1 = x[:, D:]
2025-05-07T20:32:16.6495711Z     
2025-05-07T20:32:16.6495795Z         if contiguous:
2025-05-07T20:32:16.6495886Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6495977Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6496051Z     
2025-05-07T20:32:16.6496140Z         if scale_ub is not None:
2025-05-07T20:32:16.6496251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6496387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6496465Z             )
2025-05-07T20:32:16.6496542Z         else:
2025-05-07T20:32:16.6496714Z             scale_ub_tensor = None
2025-05-07T20:32:16.6496793Z     
2025-05-07T20:32:16.6496923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6497015Z             op = silu_mul_quant
2025-05-07T20:32:16.6497103Z             if compiled:
2025-05-07T20:32:16.6497204Z                 op = torch.compile(op)
2025-05-07T20:32:16.6497311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6497388Z     
2025-05-07T20:32:16.6497478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6497483Z 
2025-05-07T20:32:16.6497584Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6497720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6497820Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6497924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6498454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6498550Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6498976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6499207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6499560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6499695Z     kernel = self.compile(
2025-05-07T20:32:16.6500098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6500279Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6500454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6500459Z 
2025-05-07T20:32:16.6500670Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7845f10>
2025-05-07T20:32:16.6501487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6502007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7910cc0>}
2025-05-07T20:32:16.6502789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6502984Z context = <triton._C.libtriton.ir.context object at 0x7f94f791f470>
2025-05-07T20:32:16.6502990Z 
2025-05-07T20:32:16.6503156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6503431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6503538Z                            module_map=module_map)
2025-05-07T20:32:16.6503707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6503809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6503885Z E       ^
2025-05-07T20:32:16.6504256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6504263Z 
2025-05-07T20:32:16.6504693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6504698Z 
2025-05-07T20:32:16.6504800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6505032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6505113Z     T=2048,
2025-05-07T20:32:16.6505190Z     D=7168,
2025-05-07T20:32:16.6505272Z     scale_ub=None,
2025-05-07T20:32:16.6505356Z     contiguous=False,
2025-05-07T20:32:16.6505491Z     compiled=False,
2025-05-07T20:32:16.6505565Z )
2025-05-07T20:32:16.6505789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6505975Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6505980Z 
2025-05-07T20:32:16.6506057Z     @given(
2025-05-07T20:32:16.6506469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6506599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6506713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6506832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6506945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6507017Z     )
2025-05-07T20:32:16.6507274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6507366Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6507443Z         self,
2025-05-07T20:32:16.6507521Z         T: int,
2025-05-07T20:32:16.6507599Z         D: int,
2025-05-07T20:32:16.6507696Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6507787Z         contiguous: bool,
2025-05-07T20:32:16.6507962Z         compiled: bool,
2025-05-07T20:32:16.6508042Z     ) -> None:
2025-05-07T20:32:16.6508139Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6508213Z     
2025-05-07T20:32:16.6508388Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6510306Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6510371Z 
2025-05-07T20:32:16.6510494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6510499Z 
2025-05-07T20:32:16.6510603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6510833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6510915Z     T=128,
2025-05-07T20:32:16.6510994Z     D=7168,
2025-05-07T20:32:16.6511077Z     scale_ub=1200.0,
2025-05-07T20:32:16.6511167Z     contiguous=True,
2025-05-07T20:32:16.6511252Z     compiled=True,
2025-05-07T20:32:16.6511324Z )
2025-05-07T20:32:16.6511549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6511721Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6511728Z 
2025-05-07T20:32:16.6511806Z     @given(
2025-05-07T20:32:16.6511924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6512023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6512141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6512256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6512371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6512451Z     )
2025-05-07T20:32:16.6512701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6512800Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6512875Z         self,
2025-05-07T20:32:16.6512950Z         T: int,
2025-05-07T20:32:16.6513029Z         D: int,
2025-05-07T20:32:16.6513124Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6513212Z         contiguous: bool,
2025-05-07T20:32:16.6513299Z         compiled: bool,
2025-05-07T20:32:16.6513378Z     ) -> None:
2025-05-07T20:32:16.6513471Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6513546Z     
2025-05-07T20:32:16.6513715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6513855Z     
2025-05-07T20:32:16.6513954Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6514077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6514165Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6514250Z         x0 = x[:, :D]
2025-05-07T20:32:16.6514327Z         x1 = x[:, D:]
2025-05-07T20:32:16.6514400Z     
2025-05-07T20:32:16.6514482Z         if contiguous:
2025-05-07T20:32:16.6514576Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6514666Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6514737Z     
2025-05-07T20:32:16.6514826Z         if scale_ub is not None:
2025-05-07T20:32:16.6514934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6515071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6515149Z             )
2025-05-07T20:32:16.6515227Z         else:
2025-05-07T20:32:16.6515322Z             scale_ub_tensor = None
2025-05-07T20:32:16.6515393Z     
2025-05-07T20:32:16.6515528Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6515616Z             op = silu_mul_quant
2025-05-07T20:32:16.6515701Z             if compiled:
2025-05-07T20:32:16.6515844Z                 op = torch.compile(op)
2025-05-07T20:32:16.6515951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6516026Z     
2025-05-07T20:32:16.6516115Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6516158Z 
2025-05-07T20:32:16.6516255Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6516388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6516489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6516588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6516974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6517108Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6517626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6517722Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6518093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6518325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6518676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6518770Z     kernel = self.compile(
2025-05-07T20:32:16.6519166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6519343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6519480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6519484Z 
2025-05-07T20:32:16.6519693Z self = <triton.compiler.compiler.ASTSource object at 0x7f94f7689730>
2025-05-07T20:32:16.6520502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6521023Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9717335440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f94f7911a80>}
2025-05-07T20:32:16.6521803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6522001Z context = <triton._C.libtriton.ir.context object at 0x7f94f76f88f0>
2025-05-07T20:32:16.6522005Z 
2025-05-07T20:32:16.6522172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6522492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6522601Z                            module_map=module_map)
2025-05-07T20:32:16.6522767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6522870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6522949Z E       ^
2025-05-07T20:32:16.6523316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6523323Z 
2025-05-07T20:32:16.6523750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6523755Z 
2025-05-07T20:32:16.6523857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6524089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6524164Z     T=128,
2025-05-07T20:32:16.6524240Z     D=7168,
2025-05-07T20:32:16.6524326Z     scale_ub=1200.0,
2025-05-07T20:32:16.6524415Z     contiguous=True,
2025-05-07T20:32:16.6524497Z     compiled=False,
2025-05-07T20:32:16.6524571Z )
2025-05-07T20:32:16.6524862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6525039Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6525047Z 
2025-05-07T20:32:16.6525122Z     @given(
2025-05-07T20:32:16.6525277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6525377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6525492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6525607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6525727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6525840Z     )
2025-05-07T20:32:16.6526091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6526185Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6526261Z         self,
2025-05-07T20:32:16.6526335Z         T: int,
2025-05-07T20:32:16.6526416Z         D: int,
2025-05-07T20:32:16.6526512Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6526605Z         contiguous: bool,
2025-05-07T20:32:16.6526690Z         compiled: bool,
2025-05-07T20:32:16.6526768Z     ) -> None:
2025-05-07T20:32:16.6526867Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6526941Z     
2025-05-07T20:32:16.6527109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6527185Z     
2025-05-07T20:32:16.6527274Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6527396Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6529258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6529267Z 
2025-05-07T20:32:16.6529385Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6529393Z 
2025-05-07T20:32:16.6529500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6529726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6529803Z     T=128,
2025-05-07T20:32:16.6529878Z     D=5120,
2025-05-07T20:32:16.6529960Z     scale_ub=1200.0,
2025-05-07T20:32:16.6530046Z     contiguous=True,
2025-05-07T20:32:16.6530130Z     compiled=True,
2025-05-07T20:32:16.6530201Z )
2025-05-07T20:32:16.6530428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6530640Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6530645Z 
2025-05-07T20:32:16.6530720Z     @given(
2025-05-07T20:32:16.6530844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6530940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6531054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6531168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6531282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6531359Z     )
2025-05-07T20:32:16.6531609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6531702Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6531873Z         self,
2025-05-07T20:32:16.6531956Z         T: int,
2025-05-07T20:32:16.6532032Z         D: int,
2025-05-07T20:32:16.6532134Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6532222Z         contiguous: bool,
2025-05-07T20:32:16.6532308Z         compiled: bool,
2025-05-07T20:32:16.6532390Z     ) -> None:
2025-05-07T20:32:16.6532483Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6532557Z     
2025-05-07T20:32:16.6532778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6532850Z     
2025-05-07T20:32:16.6532943Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6533066Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6534949Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6535003Z 
2025-05-07T20:32:16.6535120Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.6535125Z 
2025-05-07T20:32:16.6535226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6535459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6535536Z     T=128,
2025-05-07T20:32:16.6535611Z     D=7168,
2025-05-07T20:32:16.6535699Z     scale_ub=None,
2025-05-07T20:32:16.6535784Z     contiguous=True,
2025-05-07T20:32:16.6535868Z     compiled=True,
2025-05-07T20:32:16.6535939Z )
2025-05-07T20:32:16.6536161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6536331Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.6536338Z 
2025-05-07T20:32:16.6536414Z     @given(
2025-05-07T20:32:16.6536531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6536632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6536745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6536861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6536979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6537052Z     )
2025-05-07T20:32:16.6537304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6537396Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6537474Z         self,
2025-05-07T20:32:16.6537551Z         T: int,
2025-05-07T20:32:16.6537625Z         D: int,
2025-05-07T20:32:16.6537725Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6537817Z         contiguous: bool,
2025-05-07T20:32:16.6537901Z         compiled: bool,
2025-05-07T20:32:16.6537978Z     ) -> None:
2025-05-07T20:32:16.6538079Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6538151Z     
2025-05-07T20:32:16.6538317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6540217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6540226Z 
2025-05-07T20:32:16.6540344Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6540484Z =============================== warnings summary ===============================
2025-05-07T20:32:16.6540807Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.6541126Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.6541438Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:16.6542389Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:16.6542665Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:16.6542669Z 
2025-05-07T20:32:16.6542885Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:16.6543056Z ================= 1 failed, 1 deselected, 3 warnings in 14.99s =================
2025-05-07T20:32:18.2737500Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:18.3378220Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:18.3378484Z 
2025-05-07T20:32:20.3397261Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:22.4834593Z ============================= test session starts ==============================
2025-05-07T20:32:22.4835854Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:22.4836911Z cachedir: .pytest_cache
2025-05-07T20:32:22.4838080Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:22.4839430Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:22.4839846Z plugins: hypothesis-6.131.14
2025-05-07T20:32:24.0927425Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:24.2000446Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:24.2001554Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:24.2002120Z 
2025-05-07T20:32:26.5492578Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.5493939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.5494793Z     T=1,
2025-05-07T20:32:26.5495165Z     D=5120,
2025-05-07T20:32:26.5495535Z     scale_ub=None,
2025-05-07T20:32:26.5495946Z     contiguous=True,
2025-05-07T20:32:26.5496382Z     compiled=True,
2025-05-07T20:32:26.5496777Z )
2025-05-07T20:32:26.5497423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.5498424Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.5498955Z 
2025-05-07T20:32:26.5499110Z     @given(
2025-05-07T20:32:26.5499970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.5500294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.5500618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.5500952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.5501291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.5501587Z     )
2025-05-07T20:32:26.5501943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.5502396Z     def test_silu_mul_quant(
2025-05-07T20:32:26.5502647Z         self,
2025-05-07T20:32:26.5502838Z         T: int,
2025-05-07T20:32:26.5503042Z         D: int,
2025-05-07T20:32:26.5503265Z         scale_ub: Optional[float],
2025-05-07T20:32:26.5503540Z         contiguous: bool,
2025-05-07T20:32:26.5503787Z         compiled: bool,
2025-05-07T20:32:26.5504023Z     ) -> None:
2025-05-07T20:32:26.5504239Z         torch.manual_seed(2025)
2025-05-07T20:32:26.5504487Z     
2025-05-07T20:32:26.5504775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.5505119Z     
2025-05-07T20:32:26.5505419Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.5505728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.5506049Z         x = x_sign * x_clamp
2025-05-07T20:32:26.5506551Z         x0 = x[:, :D]
2025-05-07T20:32:26.5506872Z         x1 = x[:, D:]
2025-05-07T20:32:26.5507086Z     
2025-05-07T20:32:26.5507270Z         if contiguous:
2025-05-07T20:32:26.5507508Z             x0 = x0.contiguous()
2025-05-07T20:32:26.5507773Z             x1 = x1.contiguous()
2025-05-07T20:32:26.5508010Z     
2025-05-07T20:32:26.5508206Z         if scale_ub is not None:
2025-05-07T20:32:26.5508579Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.5508915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.5509230Z             )
2025-05-07T20:32:26.5509426Z         else:
2025-05-07T20:32:26.5509637Z             scale_ub_tensor = None
2025-05-07T20:32:26.5509901Z     
2025-05-07T20:32:26.5510181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5510504Z             op = silu_mul_quant
2025-05-07T20:32:26.5510764Z             if compiled:
2025-05-07T20:32:26.5511019Z                 op = torch.compile(op)
2025-05-07T20:32:26.5511322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.5511600Z     
2025-05-07T20:32:26.5511799Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.5512092Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.5512383Z     
2025-05-07T20:32:26.5512627Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5512973Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.5513270Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.5513596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.5513966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5514278Z     
2025-05-07T20:32:26.5514483Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.5514687Z 
2025-05-07T20:32:26.5514792Z moe/activation_test.py:126: 
2025-05-07T20:32:26.5515099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5515440Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.5515779Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5516761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.5517542Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.5518106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.5518814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.5519613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.5520363Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.5521164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.5521823Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.5522446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.5522973Z     fn()
2025-05-07T20:32:26.5523493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.5524095Z     self.fn.run(
2025-05-07T20:32:26.5524569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.5525115Z     kernel = self.compile(
2025-05-07T20:32:26.5525675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.5526424Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.5526829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5527071Z 
2025-05-07T20:32:26.5527286Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef87c800>
2025-05-07T20:32:26.5528457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.5529948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef86dc60>}
2025-05-07T20:32:26.5531337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.5532471Z context = <triton._C.libtriton.ir.context object at 0x7f8b230310b0>
2025-05-07T20:32:26.5532779Z 
2025-05-07T20:32:26.5532951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.5533493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.5533970Z                            module_map=module_map)
2025-05-07T20:32:26.5534347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.5534712Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.5534982Z E       ^
2025-05-07T20:32:26.5535463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.5535937Z 
2025-05-07T20:32:26.5536370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.5536901Z 
2025-05-07T20:32:26.5537016Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.5537436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.5537849Z     T=2048,
2025-05-07T20:32:26.5538040Z     D=5120,
2025-05-07T20:32:26.5538233Z     scale_ub=1200.0,
2025-05-07T20:32:26.5538458Z     contiguous=True,
2025-05-07T20:32:26.5538684Z     compiled=False,
2025-05-07T20:32:26.5538887Z )
2025-05-07T20:32:27.2863286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2863882Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.2864200Z 
2025-05-07T20:32:27.2864282Z     @given(
2025-05-07T20:32:27.2864527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2864844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2865448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2865795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2866136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2866438Z     )
2025-05-07T20:32:27.2866806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2867271Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2867529Z         self,
2025-05-07T20:32:27.2867736Z         T: int,
2025-05-07T20:32:27.2867945Z         D: int,
2025-05-07T20:32:27.2868170Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2868461Z         contiguous: bool,
2025-05-07T20:32:27.2868715Z         compiled: bool,
2025-05-07T20:32:27.2868950Z     ) -> None:
2025-05-07T20:32:27.2869181Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2869434Z     
2025-05-07T20:32:27.2869713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2870069Z     
2025-05-07T20:32:27.2870280Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2870577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2870898Z         x = x_sign * x_clamp
2025-05-07T20:32:27.2871250Z         x0 = x[:, :D]
2025-05-07T20:32:27.2871473Z         x1 = x[:, D:]
2025-05-07T20:32:27.2871691Z     
2025-05-07T20:32:27.2871888Z         if contiguous:
2025-05-07T20:32:27.2872133Z             x0 = x0.contiguous()
2025-05-07T20:32:27.2872473Z             x1 = x1.contiguous()
2025-05-07T20:32:27.2872722Z     
2025-05-07T20:32:27.2872923Z         if scale_ub is not None:
2025-05-07T20:32:27.2873201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.2873548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.2873989Z             )
2025-05-07T20:32:27.2874192Z         else:
2025-05-07T20:32:27.2874415Z             scale_ub_tensor = None
2025-05-07T20:32:27.2874683Z     
2025-05-07T20:32:27.2874924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.2875253Z             op = silu_mul_quant
2025-05-07T20:32:27.2875517Z             if compiled:
2025-05-07T20:32:27.2875768Z                 op = torch.compile(op)
2025-05-07T20:32:27.2876081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2876370Z     
2025-05-07T20:32:27.2876567Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.2876748Z 
2025-05-07T20:32:27.2876858Z moe/activation_test.py:117: 
2025-05-07T20:32:27.2877168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2877513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.2877803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2878529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.2879253Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.2879812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.2880528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.2881225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.2881788Z     kernel = self.compile(
2025-05-07T20:32:27.2882351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.2883039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.2883458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2883695Z 
2025-05-07T20:32:27.2883917Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b36080080>
2025-05-07T20:32:27.2885092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.2886556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef6c8220>}
2025-05-07T20:32:27.2887976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.2889058Z context = <triton._C.libtriton.ir.context object at 0x7f8aefb2d130>
2025-05-07T20:32:27.2889360Z 
2025-05-07T20:32:27.2889540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.2890083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.2890625Z                            module_map=module_map)
2025-05-07T20:32:27.2891010Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.2899355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.2899637Z E       ^
2025-05-07T20:32:27.2900204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.2900674Z 
2025-05-07T20:32:27.2901117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.2901690Z 
2025-05-07T20:32:27.2901808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2902234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2902657Z     T=2048,
2025-05-07T20:32:27.2902860Z     D=5120,
2025-05-07T20:32:27.2903104Z     scale_ub=1200.0,
2025-05-07T20:32:27.2903340Z     contiguous=True,
2025-05-07T20:32:27.2903576Z     compiled=True,
2025-05-07T20:32:27.2903791Z )
2025-05-07T20:32:27.2904130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2904652Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:27.2904937Z 
2025-05-07T20:32:27.2905030Z     @given(
2025-05-07T20:32:27.2905276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2905612Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2905933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2906615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2906962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2907258Z     )
2025-05-07T20:32:27.2907623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2908077Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2908337Z         self,
2025-05-07T20:32:27.2908544Z         T: int,
2025-05-07T20:32:27.2908746Z         D: int,
2025-05-07T20:32:27.2908977Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2909264Z         contiguous: bool,
2025-05-07T20:32:27.2909510Z         compiled: bool,
2025-05-07T20:32:27.2909746Z     ) -> None:
2025-05-07T20:32:27.2909975Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2910249Z     
2025-05-07T20:32:27.2910558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2910914Z     
2025-05-07T20:32:27.2911112Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2911419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2911746Z         x = x_sign * x_clamp
2025-05-07T20:32:27.2912002Z         x0 = x[:, :D]
2025-05-07T20:32:27.2912226Z         x1 = x[:, D:]
2025-05-07T20:32:27.2912447Z     
2025-05-07T20:32:27.2912643Z         if contiguous:
2025-05-07T20:32:27.2912882Z             x0 = x0.contiguous()
2025-05-07T20:32:27.2913154Z             x1 = x1.contiguous()
2025-05-07T20:32:27.2913405Z     
2025-05-07T20:32:27.2913599Z         if scale_ub is not None:
2025-05-07T20:32:27.2913903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.2914341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.2914664Z             )
2025-05-07T20:32:27.2914860Z         else:
2025-05-07T20:32:27.2915082Z             scale_ub_tensor = None
2025-05-07T20:32:27.2915346Z     
2025-05-07T20:32:27.2915584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.2915913Z             op = silu_mul_quant
2025-05-07T20:32:27.2916181Z             if compiled:
2025-05-07T20:32:27.2916435Z                 op = torch.compile(op)
2025-05-07T20:32:27.2916741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2917029Z     
2025-05-07T20:32:27.2917226Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.2917525Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.2917832Z     
2025-05-07T20:32:27.2918079Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.2918422Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.2918732Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.2919063Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.2919504Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.2919836Z     
2025-05-07T20:32:27.2920051Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.2920277Z 
2025-05-07T20:32:27.2920395Z moe/activation_test.py:126: 
2025-05-07T20:32:27.2920777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2921131Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.2921476Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.2922297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.2923149Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.2923728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.2924434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.2925155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.2925905Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.2926672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.2927332Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.2927961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.2928508Z     fn()
2025-05-07T20:32:27.2929036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.2929639Z     self.fn.run(
2025-05-07T20:32:27.2930134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.2930697Z     kernel = self.compile(
2025-05-07T20:32:27.2931268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.2932052Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.2932473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2932712Z 
2025-05-07T20:32:27.2932936Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef6a6fc0>
2025-05-07T20:32:27.2934081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.2935589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef6c96c0>}
2025-05-07T20:32:27.2937024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.2938117Z context = <triton._C.libtriton.ir.context object at 0x7f8aee29e2b0>
2025-05-07T20:32:27.2938424Z 
2025-05-07T20:32:27.2938605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.2939149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.2939643Z                            module_map=module_map)
2025-05-07T20:32:27.2940025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.2940394Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.2940673Z E       ^
2025-05-07T20:32:27.2941165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.2941642Z 
2025-05-07T20:32:27.2942138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.2942685Z 
2025-05-07T20:32:27.2942792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2943266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2943691Z     T=16384,
2025-05-07T20:32:27.2943886Z     D=7168,
2025-05-07T20:32:27.2944089Z     scale_ub=1200.0,
2025-05-07T20:32:27.2944324Z     contiguous=False,
2025-05-07T20:32:27.2944554Z     compiled=False,
2025-05-07T20:32:27.2944812Z )
2025-05-07T20:32:28.0224034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.0224641Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.0224942Z 
2025-05-07T20:32:28.0225051Z     @given(
2025-05-07T20:32:28.0225291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.0225604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.0225923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.0226258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.0226598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.0226889Z     )
2025-05-07T20:32:28.0227248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.0227707Z     def test_silu_mul_quant(
2025-05-07T20:32:28.0227954Z         self,
2025-05-07T20:32:28.0228155Z         T: int,
2025-05-07T20:32:28.0228361Z         D: int,
2025-05-07T20:32:28.0228585Z         scale_ub: Optional[float],
2025-05-07T20:32:28.0228866Z         contiguous: bool,
2025-05-07T20:32:28.0229117Z         compiled: bool,
2025-05-07T20:32:28.0229347Z     ) -> None:
2025-05-07T20:32:28.0229572Z         torch.manual_seed(2025)
2025-05-07T20:32:28.0229820Z     
2025-05-07T20:32:28.0230095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.0230450Z     
2025-05-07T20:32:28.0230657Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.0230950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.0231273Z         x = x_sign * x_clamp
2025-05-07T20:32:28.0231527Z         x0 = x[:, :D]
2025-05-07T20:32:28.0231750Z         x1 = x[:, D:]
2025-05-07T20:32:28.0231960Z     
2025-05-07T20:32:28.0232153Z         if contiguous:
2025-05-07T20:32:28.0232397Z             x0 = x0.contiguous()
2025-05-07T20:32:28.0232660Z             x1 = x1.contiguous()
2025-05-07T20:32:28.0232908Z     
2025-05-07T20:32:28.0233109Z         if scale_ub is not None:
2025-05-07T20:32:28.0233387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.0233737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.0234059Z             )
2025-05-07T20:32:28.0234512Z         else:
2025-05-07T20:32:28.0234737Z             scale_ub_tensor = None
2025-05-07T20:32:28.0235000Z     
2025-05-07T20:32:28.0235236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.0235561Z             op = silu_mul_quant
2025-05-07T20:32:28.0235825Z             if compiled:
2025-05-07T20:32:28.0236074Z                 op = torch.compile(op)
2025-05-07T20:32:28.0236386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0236673Z     
2025-05-07T20:32:28.0236872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.0237039Z 
2025-05-07T20:32:28.0237143Z moe/activation_test.py:117: 
2025-05-07T20:32:28.0237452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0237801Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.0238084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0238808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.0239527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.0240190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.0240955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.0241647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.0242277Z     kernel = self.compile(
2025-05-07T20:32:28.0242833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.0243516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.0244007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0244241Z 
2025-05-07T20:32:28.0244461Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef7d4620>
2025-05-07T20:32:28.0245592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.0247039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57ce00>}
2025-05-07T20:32:28.0248442Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.0249508Z context = <triton._C.libtriton.ir.context object at 0x7f8aee2fa0b0>
2025-05-07T20:32:28.0249809Z 
2025-05-07T20:32:28.0249986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.0250525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.0251008Z                            module_map=module_map)
2025-05-07T20:32:28.0251382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.0251744Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.0252074Z E       ^
2025-05-07T20:32:28.0252577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.0253054Z 
2025-05-07T20:32:28.0253489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.0254024Z 
2025-05-07T20:32:28.0254139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.0254568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.0254987Z     T=1,
2025-05-07T20:32:28.0255184Z     D=7168,
2025-05-07T20:32:28.0255380Z     scale_ub=None,
2025-05-07T20:32:28.0255655Z     contiguous=True,
2025-05-07T20:32:28.0255891Z     compiled=True,
2025-05-07T20:32:28.0256108Z )
2025-05-07T20:32:28.0256438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.0256945Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.0257213Z 
2025-05-07T20:32:28.0257301Z     @given(
2025-05-07T20:32:28.0257538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.0257867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.0258185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.0258520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.0258859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.0259157Z     )
2025-05-07T20:32:28.0259520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.0259973Z     def test_silu_mul_quant(
2025-05-07T20:32:28.0260231Z         self,
2025-05-07T20:32:28.0260440Z         T: int,
2025-05-07T20:32:28.0260639Z         D: int,
2025-05-07T20:32:28.0260864Z         scale_ub: Optional[float],
2025-05-07T20:32:28.0261192Z         contiguous: bool,
2025-05-07T20:32:28.0261437Z         compiled: bool,
2025-05-07T20:32:28.0261667Z     ) -> None:
2025-05-07T20:32:28.0261890Z         torch.manual_seed(2025)
2025-05-07T20:32:28.0262133Z     
2025-05-07T20:32:28.0262457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.0262811Z     
2025-05-07T20:32:28.0263008Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.0263312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.0263631Z         x = x_sign * x_clamp
2025-05-07T20:32:28.0263875Z         x0 = x[:, :D]
2025-05-07T20:32:28.0264143Z         x1 = x[:, D:]
2025-05-07T20:32:28.0264357Z     
2025-05-07T20:32:28.0264545Z         if contiguous:
2025-05-07T20:32:28.0264790Z             x0 = x0.contiguous()
2025-05-07T20:32:28.0265060Z             x1 = x1.contiguous()
2025-05-07T20:32:28.0265308Z     
2025-05-07T20:32:28.0265506Z         if scale_ub is not None:
2025-05-07T20:32:28.0265788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.0266133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.0266446Z             )
2025-05-07T20:32:28.0266649Z         else:
2025-05-07T20:32:28.0266868Z             scale_ub_tensor = None
2025-05-07T20:32:28.0267125Z     
2025-05-07T20:32:28.0267366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.0267695Z             op = silu_mul_quant
2025-05-07T20:32:28.0267950Z             if compiled:
2025-05-07T20:32:28.0268212Z                 op = torch.compile(op)
2025-05-07T20:32:28.0268524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0268805Z     
2025-05-07T20:32:28.0269007Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.0269308Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.0269604Z     
2025-05-07T20:32:28.0269854Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.0270202Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.0270524Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.0270881Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.0271258Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.0271582Z     
2025-05-07T20:32:28.0271788Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.0271996Z 
2025-05-07T20:32:28.0272098Z moe/activation_test.py:126: 
2025-05-07T20:32:28.0272415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0272757Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.0273102Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.0273974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.0274758Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.0275319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.0276029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.0276749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.0277500Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.0278255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.0278924Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.0279550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.0280082Z     fn()
2025-05-07T20:32:28.0280615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.0281272Z     self.fn.run(
2025-05-07T20:32:28.0281804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.0282354Z     kernel = self.compile(
2025-05-07T20:32:28.0282913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.0283632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.0284041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0284285Z 
2025-05-07T20:32:28.0284541Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee423a10>
2025-05-07T20:32:28.0285676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.0287111Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee5ab600>}
2025-05-07T20:32:28.0288519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.0289587Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9bc6970>
2025-05-07T20:32:28.0289894Z 
2025-05-07T20:32:28.0290064Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.0290615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.0291098Z                            module_map=module_map)
2025-05-07T20:32:28.0291469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.0291899Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.0292179Z E       ^
2025-05-07T20:32:28.0292661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.0293141Z 
2025-05-07T20:32:28.0293574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.0294116Z 
2025-05-07T20:32:28.0294226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.0294654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.0295067Z     T=4096,
2025-05-07T20:32:28.0295263Z     D=5120,
2025-05-07T20:32:28.0295463Z     scale_ub=None,
2025-05-07T20:32:28.0295684Z     contiguous=False,
2025-05-07T20:32:28.0295920Z     compiled=False,
2025-05-07T20:32:28.0296132Z )
2025-05-07T20:32:28.8252612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8253178Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.8253492Z 
2025-05-07T20:32:28.8253575Z     @given(
2025-05-07T20:32:28.8253810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8254126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8254430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8254766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8255095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8255386Z     )
2025-05-07T20:32:28.8255749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8256209Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8256454Z         self,
2025-05-07T20:32:28.8256660Z         T: int,
2025-05-07T20:32:28.8256865Z         D: int,
2025-05-07T20:32:28.8257089Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8257372Z         contiguous: bool,
2025-05-07T20:32:28.8257625Z         compiled: bool,
2025-05-07T20:32:28.8257854Z     ) -> None:
2025-05-07T20:32:28.8258156Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8258409Z     
2025-05-07T20:32:28.8258692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8259037Z     
2025-05-07T20:32:28.8259240Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.8259609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.8259927Z         x = x_sign * x_clamp
2025-05-07T20:32:28.8260175Z         x0 = x[:, :D]
2025-05-07T20:32:28.8260398Z         x1 = x[:, D:]
2025-05-07T20:32:28.8260606Z     
2025-05-07T20:32:28.8260798Z         if contiguous:
2025-05-07T20:32:28.8261137Z             x0 = x0.contiguous()
2025-05-07T20:32:28.8261399Z             x1 = x1.contiguous()
2025-05-07T20:32:28.8261648Z     
2025-05-07T20:32:28.8261848Z         if scale_ub is not None:
2025-05-07T20:32:28.8262163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.8262513Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.8262826Z             )
2025-05-07T20:32:28.8263031Z         else:
2025-05-07T20:32:28.8263248Z             scale_ub_tensor = None
2025-05-07T20:32:28.8263501Z     
2025-05-07T20:32:28.8263742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8264075Z             op = silu_mul_quant
2025-05-07T20:32:28.8264333Z             if compiled:
2025-05-07T20:32:28.8264591Z                 op = torch.compile(op)
2025-05-07T20:32:28.8264899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8265177Z     
2025-05-07T20:32:28.8265381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.8265561Z 
2025-05-07T20:32:28.8265664Z moe/activation_test.py:117: 
2025-05-07T20:32:28.8265973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8266311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.8266607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8267333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.8268048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.8268609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.8269323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.8270016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.8270569Z     kernel = self.compile(
2025-05-07T20:32:28.8271132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.8271826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8272282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8272523Z 
2025-05-07T20:32:28.8272737Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef511850>
2025-05-07T20:32:28.8273867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.8275300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee5ab420>}
2025-05-07T20:32:28.8276702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.8277762Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9ea6bb0>
2025-05-07T20:32:28.8278070Z 
2025-05-07T20:32:28.8278240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.8278829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8279316Z                            module_map=module_map)
2025-05-07T20:32:28.8279684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8280087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.8280352Z E       ^
2025-05-07T20:32:28.8280826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8281298Z 
2025-05-07T20:32:28.8281726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.8282309Z 
2025-05-07T20:32:28.8282417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8282852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8283271Z     T=4096,
2025-05-07T20:32:28.8283467Z     D=7168,
2025-05-07T20:32:28.8291393Z     scale_ub=None,
2025-05-07T20:32:28.8291651Z     contiguous=False,
2025-05-07T20:32:28.8291978Z     compiled=False,
2025-05-07T20:32:28.8292192Z )
2025-05-07T20:32:28.8292518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8293044Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.8293332Z 
2025-05-07T20:32:28.8293422Z     @given(
2025-05-07T20:32:28.8293662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8293991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8294313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8294662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8294998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8295297Z     )
2025-05-07T20:32:28.8295666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8296118Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8296373Z         self,
2025-05-07T20:32:28.8296583Z         T: int,
2025-05-07T20:32:28.8296784Z         D: int,
2025-05-07T20:32:28.8297012Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8297289Z         contiguous: bool,
2025-05-07T20:32:28.8297537Z         compiled: bool,
2025-05-07T20:32:28.8297756Z     ) -> None:
2025-05-07T20:32:28.8297976Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8298224Z     
2025-05-07T20:32:28.8298497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8298850Z     
2025-05-07T20:32:28.8299051Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.8299351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.8299668Z         x = x_sign * x_clamp
2025-05-07T20:32:28.8299919Z         x0 = x[:, :D]
2025-05-07T20:32:28.8300133Z         x1 = x[:, D:]
2025-05-07T20:32:28.8300421Z     
2025-05-07T20:32:28.8300616Z         if contiguous:
2025-05-07T20:32:28.8300850Z             x0 = x0.contiguous()
2025-05-07T20:32:28.8301126Z             x1 = x1.contiguous()
2025-05-07T20:32:28.8301372Z     
2025-05-07T20:32:28.8301562Z         if scale_ub is not None:
2025-05-07T20:32:28.8301844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.8302187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.8302506Z             )
2025-05-07T20:32:28.8302696Z         else:
2025-05-07T20:32:28.8302911Z             scale_ub_tensor = None
2025-05-07T20:32:28.8303165Z     
2025-05-07T20:32:28.8303399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8303722Z             op = silu_mul_quant
2025-05-07T20:32:28.8303981Z             if compiled:
2025-05-07T20:32:28.8304226Z                 op = torch.compile(op)
2025-05-07T20:32:28.8304530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8304812Z     
2025-05-07T20:32:28.8305005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.8305180Z 
2025-05-07T20:32:28.8305281Z moe/activation_test.py:117: 
2025-05-07T20:32:28.8305640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8305980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.8306513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8307317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.8308030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.8308577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.8309348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.8310037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.8310592Z     kernel = self.compile(
2025-05-07T20:32:28.8311145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.8311823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8312237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8312478Z 
2025-05-07T20:32:28.8312694Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee54f860>
2025-05-07T20:32:28.8313818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.8315246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee58c360>}
2025-05-07T20:32:28.8316638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.8317702Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9e1f470>
2025-05-07T20:32:28.8317999Z 
2025-05-07T20:32:28.8318172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.8318715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8319198Z                            module_map=module_map)
2025-05-07T20:32:28.8319566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8319936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.8320205Z E       ^
2025-05-07T20:32:28.8320688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8321227Z 
2025-05-07T20:32:28.8321657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.8322193Z 
2025-05-07T20:32:28.8322301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8322730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8323149Z     T=128,
2025-05-07T20:32:28.8323338Z     D=7168,
2025-05-07T20:32:28.8323540Z     scale_ub=None,
2025-05-07T20:32:28.8323761Z     contiguous=False,
2025-05-07T20:32:28.8323988Z     compiled=True,
2025-05-07T20:32:28.8324197Z )
2025-05-07T20:32:28.8863434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8863987Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.8864272Z 
2025-05-07T20:32:28.8864350Z     @given(
2025-05-07T20:32:28.8864585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8864909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8865218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8865555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8865994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8866286Z     )
2025-05-07T20:32:28.8866642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8867160Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8867416Z         self,
2025-05-07T20:32:28.8867618Z         T: int,
2025-05-07T20:32:28.8867828Z         D: int,
2025-05-07T20:32:28.8868060Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8868332Z         contiguous: bool,
2025-05-07T20:32:28.8868578Z         compiled: bool,
2025-05-07T20:32:28.8868881Z     ) -> None:
2025-05-07T20:32:28.8869095Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8869345Z     
2025-05-07T20:32:28.8869623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8869968Z     
2025-05-07T20:32:28.8870170Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.8870469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.8870790Z         x = x_sign * x_clamp
2025-05-07T20:32:28.8871034Z         x0 = x[:, :D]
2025-05-07T20:32:28.8871257Z         x1 = x[:, D:]
2025-05-07T20:32:28.8871470Z     
2025-05-07T20:32:28.8871656Z         if contiguous:
2025-05-07T20:32:28.8871898Z             x0 = x0.contiguous()
2025-05-07T20:32:28.8872172Z             x1 = x1.contiguous()
2025-05-07T20:32:28.8872418Z     
2025-05-07T20:32:28.8872614Z         if scale_ub is not None:
2025-05-07T20:32:28.8872894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.8873228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.8873545Z             )
2025-05-07T20:32:28.8873741Z         else:
2025-05-07T20:32:28.8873948Z             scale_ub_tensor = None
2025-05-07T20:32:28.8874285Z     
2025-05-07T20:32:28.8874576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8874903Z             op = silu_mul_quant
2025-05-07T20:32:28.8875164Z             if compiled:
2025-05-07T20:32:28.8875423Z                 op = torch.compile(op)
2025-05-07T20:32:28.8875717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8876003Z     
2025-05-07T20:32:28.8876205Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.8876509Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.8876805Z     
2025-05-07T20:32:28.8877048Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8877388Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.8877684Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.8878007Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.8878374Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.8878688Z     
2025-05-07T20:32:28.8878980Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.8879182Z 
2025-05-07T20:32:28.8879289Z moe/activation_test.py:126: 
2025-05-07T20:32:28.8879594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8879933Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.8880266Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.8881127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.8881905Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.8882464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.8883171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.8883882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.8884624Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.8885425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.8886084Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.8886704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.8887274Z     fn()
2025-05-07T20:32:28.8887797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.8888399Z     self.fn.run(
2025-05-07T20:32:28.8888875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.8889462Z     kernel = self.compile(
2025-05-07T20:32:28.8890020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.8890693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8891100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8891339Z 
2025-05-07T20:32:28.8891550Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b12ff0>
2025-05-07T20:32:28.8892740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.8894171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5e7a0>}
2025-05-07T20:32:28.8895557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.8896624Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9c87cb0>
2025-05-07T20:32:28.8896928Z 
2025-05-07T20:32:28.8897099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.8897637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8898119Z                            module_map=module_map)
2025-05-07T20:32:28.8898496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8898861Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.8899128Z E       ^
2025-05-07T20:32:28.8899609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8900087Z 
2025-05-07T20:32:28.8900515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.8901092Z 
2025-05-07T20:32:28.8901204Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8901630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8902044Z     T=128,
2025-05-07T20:32:28.8902237Z     D=7168,
2025-05-07T20:32:28.8902430Z     scale_ub=None,
2025-05-07T20:32:28.8902651Z     contiguous=False,
2025-05-07T20:32:28.8902886Z     compiled=False,
2025-05-07T20:32:28.8903088Z )
2025-05-07T20:32:29.0847392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0848505Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.0849076Z 
2025-05-07T20:32:29.0849239Z     @given(
2025-05-07T20:32:29.0849723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0850344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0850849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0851236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0851568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0851922Z     )
2025-05-07T20:32:29.0852394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0852847Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0853101Z         self,
2025-05-07T20:32:29.0853304Z         T: int,
2025-05-07T20:32:29.0853593Z         D: int,
2025-05-07T20:32:29.0853817Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0854095Z         contiguous: bool,
2025-05-07T20:32:29.0854342Z         compiled: bool,
2025-05-07T20:32:29.0854572Z     ) -> None:
2025-05-07T20:32:29.0854793Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0855041Z     
2025-05-07T20:32:29.0855385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0855735Z     
2025-05-07T20:32:29.0855940Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.0856238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.0856558Z         x = x_sign * x_clamp
2025-05-07T20:32:29.0856804Z         x0 = x[:, :D]
2025-05-07T20:32:29.0857021Z         x1 = x[:, D:]
2025-05-07T20:32:29.0857240Z     
2025-05-07T20:32:29.0857432Z         if contiguous:
2025-05-07T20:32:29.0857667Z             x0 = x0.contiguous()
2025-05-07T20:32:29.0857933Z             x1 = x1.contiguous()
2025-05-07T20:32:29.0858185Z     
2025-05-07T20:32:29.0858376Z         if scale_ub is not None:
2025-05-07T20:32:29.0858654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.0858994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.0859317Z             )
2025-05-07T20:32:29.0859513Z         else:
2025-05-07T20:32:29.0859734Z             scale_ub_tensor = None
2025-05-07T20:32:29.0859991Z     
2025-05-07T20:32:29.0860224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0860554Z             op = silu_mul_quant
2025-05-07T20:32:29.0860813Z             if compiled:
2025-05-07T20:32:29.0861067Z                 op = torch.compile(op)
2025-05-07T20:32:29.0861375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0861658Z     
2025-05-07T20:32:29.0861855Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.0862028Z 
2025-05-07T20:32:29.0862135Z moe/activation_test.py:117: 
2025-05-07T20:32:29.0862440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0862781Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.0863072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0863786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.0864505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.0865054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.0865835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.0866524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.0867079Z     kernel = self.compile(
2025-05-07T20:32:29.0867632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.0868311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.0868731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0868971Z 
2025-05-07T20:32:29.0869182Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9f090d0>
2025-05-07T20:32:29.0870308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.0871932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5e980>}
2025-05-07T20:32:29.0873381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.0874486Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e2ae70>
2025-05-07T20:32:29.0874784Z 
2025-05-07T20:32:29.0874954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.0875493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.0876016Z                            module_map=module_map)
2025-05-07T20:32:29.0876391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.0876749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.0877016Z E       ^
2025-05-07T20:32:29.0877494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0877962Z 
2025-05-07T20:32:29.0878391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.0878930Z 
2025-05-07T20:32:29.0879037Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0879465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0879880Z     T=4096,
2025-05-07T20:32:29.0880071Z     D=5120,
2025-05-07T20:32:29.0880270Z     scale_ub=1200.0,
2025-05-07T20:32:29.0880499Z     contiguous=True,
2025-05-07T20:32:29.0880724Z     compiled=False,
2025-05-07T20:32:29.0880969Z )
2025-05-07T20:32:29.0881325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0881834Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.0882126Z 
2025-05-07T20:32:29.0882205Z     @given(
2025-05-07T20:32:29.0882446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0882766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0883079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0883418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0883758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0884046Z     )
2025-05-07T20:32:29.0884409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0884865Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0885111Z         self,
2025-05-07T20:32:29.0885317Z         T: int,
2025-05-07T20:32:29.0885524Z         D: int,
2025-05-07T20:32:29.0885744Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0886022Z         contiguous: bool,
2025-05-07T20:32:29.0886271Z         compiled: bool,
2025-05-07T20:32:29.0886495Z     ) -> None:
2025-05-07T20:32:29.0886770Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0887022Z     
2025-05-07T20:32:29.0887299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0887652Z     
2025-05-07T20:32:29.0887850Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.0888144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.0888468Z         x = x_sign * x_clamp
2025-05-07T20:32:29.0888716Z         x0 = x[:, :D]
2025-05-07T20:32:29.0888944Z         x1 = x[:, D:]
2025-05-07T20:32:29.0889152Z     
2025-05-07T20:32:29.0889342Z         if contiguous:
2025-05-07T20:32:29.0889584Z             x0 = x0.contiguous()
2025-05-07T20:32:29.0889846Z             x1 = x1.contiguous()
2025-05-07T20:32:29.0890100Z     
2025-05-07T20:32:29.0890294Z         if scale_ub is not None:
2025-05-07T20:32:29.0890571Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.0890911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.0891230Z             )
2025-05-07T20:32:29.0891422Z         else:
2025-05-07T20:32:29.0891636Z             scale_ub_tensor = None
2025-05-07T20:32:29.0891958Z     
2025-05-07T20:32:29.0892242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0892564Z             op = silu_mul_quant
2025-05-07T20:32:29.0892826Z             if compiled:
2025-05-07T20:32:29.0893076Z                 op = torch.compile(op)
2025-05-07T20:32:29.0893420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0893699Z     
2025-05-07T20:32:29.0893896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.0894064Z 
2025-05-07T20:32:29.0894165Z moe/activation_test.py:117: 
2025-05-07T20:32:29.0894469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0894852Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.0895135Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0895843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.0896555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.0897108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.0897809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.0898494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.0899042Z     kernel = self.compile(
2025-05-07T20:32:29.0899596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.0900271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.0900679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0900915Z 
2025-05-07T20:32:29.0901133Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9f0ae70>
2025-05-07T20:32:29.0902249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.0903664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5f9c0>}
2025-05-07T20:32:29.0905072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.0906344Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e02770>
2025-05-07T20:32:29.0906681Z 
2025-05-07T20:32:29.0906858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.0907489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.0907974Z                            module_map=module_map)
2025-05-07T20:32:29.0908348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.0908710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.0908979Z E       ^
2025-05-07T20:32:29.0909461Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0909927Z 
2025-05-07T20:32:29.0910360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.0910942Z 
2025-05-07T20:32:29.0911051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0911483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0911901Z     T=1,
2025-05-07T20:32:29.0912096Z     D=5120,
2025-05-07T20:32:29.0912289Z     scale_ub=None,
2025-05-07T20:32:29.0912513Z     contiguous=True,
2025-05-07T20:32:29.0912739Z     compiled=True,
2025-05-07T20:32:29.0912945Z )
2025-05-07T20:32:29.4730906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4731448Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4731716Z 
2025-05-07T20:32:29.4731794Z     @given(
2025-05-07T20:32:29.4732139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4732454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4732761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4733090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4733423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4733776Z     )
2025-05-07T20:32:29.4734127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4734579Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4734830Z         self,
2025-05-07T20:32:29.4735028Z         T: int,
2025-05-07T20:32:29.4735230Z         D: int,
2025-05-07T20:32:29.4735453Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4735732Z         contiguous: bool,
2025-05-07T20:32:29.4735977Z         compiled: bool,
2025-05-07T20:32:29.4736207Z     ) -> None:
2025-05-07T20:32:29.4736428Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4736685Z     
2025-05-07T20:32:29.4736964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4737311Z     
2025-05-07T20:32:29.4743408Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4743743Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4744075Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4744330Z         x0 = x[:, :D]
2025-05-07T20:32:29.4744554Z         x1 = x[:, D:]
2025-05-07T20:32:29.4744767Z     
2025-05-07T20:32:29.4744961Z         if contiguous:
2025-05-07T20:32:29.4745201Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4745466Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4745709Z     
2025-05-07T20:32:29.4745909Z         if scale_ub is not None:
2025-05-07T20:32:29.4746186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4746527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4746840Z             )
2025-05-07T20:32:29.4747034Z         else:
2025-05-07T20:32:29.4747246Z             scale_ub_tensor = None
2025-05-07T20:32:29.4747504Z     
2025-05-07T20:32:29.4747736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4748058Z             op = silu_mul_quant
2025-05-07T20:32:29.4748313Z             if compiled:
2025-05-07T20:32:29.4748566Z                 op = torch.compile(op)
2025-05-07T20:32:29.4748867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4749147Z     
2025-05-07T20:32:29.4749344Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4749629Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4750030Z     
2025-05-07T20:32:29.4750278Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4750615Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4750919Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4751245Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4751609Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4751926Z     
2025-05-07T20:32:29.4752128Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4752327Z 
2025-05-07T20:32:29.4752436Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4752737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4753085Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4753418Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4754236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4755016Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4755649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4756353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4757056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4757838Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4758595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4759293Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4759911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4760446Z     fn()
2025-05-07T20:32:29.4760981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4761618Z     self.fn.run(
2025-05-07T20:32:29.4762099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4762650Z     kernel = self.compile(
2025-05-07T20:32:29.4763210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4763879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4764285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4764522Z 
2025-05-07T20:32:29.4764739Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9d48080>
2025-05-07T20:32:29.4765861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4767275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57c400>}
2025-05-07T20:32:29.4768663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4769724Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e70070>
2025-05-07T20:32:29.4770019Z 
2025-05-07T20:32:29.4770192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4770730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4771300Z                            module_map=module_map)
2025-05-07T20:32:29.4771688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4772104Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4772372Z E       ^
2025-05-07T20:32:29.4772847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4773314Z 
2025-05-07T20:32:29.4773746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4774277Z 
2025-05-07T20:32:29.4774383Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4774808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4775226Z     T=2048,
2025-05-07T20:32:29.4775426Z     D=5120,
2025-05-07T20:32:29.4775617Z     scale_ub=None,
2025-05-07T20:32:29.4775839Z     contiguous=True,
2025-05-07T20:32:29.4776067Z     compiled=True,
2025-05-07T20:32:29.4776266Z )
2025-05-07T20:32:29.8459797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.8460877Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.8461559Z 
2025-05-07T20:32:29.8461666Z     @given(
2025-05-07T20:32:29.8461922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.8462237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.8462600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.8462929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.8463253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.8463533Z     )
2025-05-07T20:32:29.8463887Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.8464398Z     def test_silu_mul_quant(
2025-05-07T20:32:29.8464643Z         self,
2025-05-07T20:32:29.8464840Z         T: int,
2025-05-07T20:32:29.8465033Z         D: int,
2025-05-07T20:32:29.8465254Z         scale_ub: Optional[float],
2025-05-07T20:32:29.8465524Z         contiguous: bool,
2025-05-07T20:32:29.8465759Z         compiled: bool,
2025-05-07T20:32:29.8465983Z     ) -> None:
2025-05-07T20:32:29.8466199Z         torch.manual_seed(2025)
2025-05-07T20:32:29.8466435Z     
2025-05-07T20:32:29.8466705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.8467049Z     
2025-05-07T20:32:29.8467249Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.8467536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.8467846Z         x = x_sign * x_clamp
2025-05-07T20:32:29.8468087Z         x0 = x[:, :D]
2025-05-07T20:32:29.8468297Z         x1 = x[:, D:]
2025-05-07T20:32:29.8468503Z     
2025-05-07T20:32:29.8468690Z         if contiguous:
2025-05-07T20:32:29.8468915Z             x0 = x0.contiguous()
2025-05-07T20:32:29.8469173Z             x1 = x1.contiguous()
2025-05-07T20:32:29.8469410Z     
2025-05-07T20:32:29.8469597Z         if scale_ub is not None:
2025-05-07T20:32:29.8469870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.8470206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.8470512Z             )
2025-05-07T20:32:29.8470705Z         else:
2025-05-07T20:32:29.8470917Z             scale_ub_tensor = None
2025-05-07T20:32:29.8471162Z     
2025-05-07T20:32:29.8471389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.8471710Z             op = silu_mul_quant
2025-05-07T20:32:29.8471959Z             if compiled:
2025-05-07T20:32:29.8472203Z                 op = torch.compile(op)
2025-05-07T20:32:29.8472505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.8472777Z     
2025-05-07T20:32:29.8472965Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.8473250Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.8473543Z     
2025-05-07T20:32:29.8473776Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.8474183Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.8474479Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.8474789Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.8475150Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.8475459Z     
2025-05-07T20:32:29.8475652Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.8475854Z 
2025-05-07T20:32:29.8475953Z moe/activation_test.py:126: 
2025-05-07T20:32:29.8476254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8476586Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.8476907Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.8477714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.8478491Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.8479045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.8479794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.8480501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.8481249Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.8482036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.8482693Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.8483309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.8483881Z     fn()
2025-05-07T20:32:29.8484399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.8484999Z     self.fn.run(
2025-05-07T20:32:29.8485482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.8486022Z     kernel = self.compile(
2025-05-07T20:32:29.8486574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.8487248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.8487648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8487883Z 
2025-05-07T20:32:29.8488094Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9d3e270>
2025-05-07T20:32:29.8489211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.8490643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9d07ba0>}
2025-05-07T20:32:29.8492101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.8493169Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8df6670>
2025-05-07T20:32:29.8493463Z 
2025-05-07T20:32:29.8493634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.8494169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.8494652Z                            module_map=module_map)
2025-05-07T20:32:29.8495019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.8495381Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.8495700Z E       ^
2025-05-07T20:32:29.8496177Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.8496643Z 
2025-05-07T20:32:29.8497070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.8497601Z 
2025-05-07T20:32:29.8497709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.8498131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.8498537Z     T=128,
2025-05-07T20:32:29.8498727Z     D=5120,
2025-05-07T20:32:29.8498919Z     scale_ub=None,
2025-05-07T20:32:29.8499131Z     contiguous=True,
2025-05-07T20:32:29.8499357Z     compiled=True,
2025-05-07T20:32:29.8499559Z )
2025-05-07T20:32:30.2789842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2790377Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.2790663Z 
2025-05-07T20:32:30.2790758Z     @given(
2025-05-07T20:32:30.2790987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2791419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2791821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2792271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2792831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2793198Z     )
2025-05-07T20:32:30.2793576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2794028Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2794270Z         self,
2025-05-07T20:32:30.2794474Z         T: int,
2025-05-07T20:32:30.2794751Z         D: int,
2025-05-07T20:32:30.2794977Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2795250Z         contiguous: bool,
2025-05-07T20:32:30.2795492Z         compiled: bool,
2025-05-07T20:32:30.2795723Z     ) -> None:
2025-05-07T20:32:30.2795938Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2796184Z     
2025-05-07T20:32:30.2796463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2796806Z     
2025-05-07T20:32:30.2797004Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2797298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2797613Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2797855Z         x0 = x[:, :D]
2025-05-07T20:32:30.2798098Z         x1 = x[:, D:]
2025-05-07T20:32:30.2798307Z     
2025-05-07T20:32:30.2798494Z         if contiguous:
2025-05-07T20:32:30.2798731Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2798997Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2799239Z     
2025-05-07T20:32:30.2799434Z         if scale_ub is not None:
2025-05-07T20:32:30.2799713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2800047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2800362Z             )
2025-05-07T20:32:30.2800557Z         else:
2025-05-07T20:32:30.2800771Z             scale_ub_tensor = None
2025-05-07T20:32:30.2801019Z     
2025-05-07T20:32:30.2801258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2801575Z             op = silu_mul_quant
2025-05-07T20:32:30.2801825Z             if compiled:
2025-05-07T20:32:30.2802074Z                 op = torch.compile(op)
2025-05-07T20:32:30.2802372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2802644Z     
2025-05-07T20:32:30.2802839Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.2803128Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.2803420Z     
2025-05-07T20:32:30.2803660Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2804003Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.2804295Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.2804687Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.2805054Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2805370Z     
2025-05-07T20:32:30.2805569Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.2805768Z 
2025-05-07T20:32:30.2805869Z moe/activation_test.py:126: 
2025-05-07T20:32:30.2806403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2806760Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.2807095Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2807916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.2808695Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.2809251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2809956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2810743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.2811541Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.2812347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.2813064Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.2813682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.2814207Z     fn()
2025-05-07T20:32:30.2814724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.2815391Z     self.fn.run(
2025-05-07T20:32:30.2815865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2816412Z     kernel = self.compile(
2025-05-07T20:32:30.2816970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2817642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2818045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2818288Z 
2025-05-07T20:32:30.2818501Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee5b8500>
2025-05-07T20:32:30.2819615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2821051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac90c9300>}
2025-05-07T20:32:30.2822496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2823550Z context = <triton._C.libtriton.ir.context object at 0x7f8ac936c730>
2025-05-07T20:32:30.2823849Z 
2025-05-07T20:32:30.2824021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2824554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2825024Z                            module_map=module_map)
2025-05-07T20:32:30.2825394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2825761Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.2826028Z E       ^
2025-05-07T20:32:30.2826566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2827035Z 
2025-05-07T20:32:30.2827470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2827999Z 
2025-05-07T20:32:30.2828104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2828525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2828933Z     T=4096,
2025-05-07T20:32:30.2829124Z     D=5120,
2025-05-07T20:32:30.2829317Z     scale_ub=None,
2025-05-07T20:32:30.2829529Z     contiguous=True,
2025-05-07T20:32:30.2829751Z     compiled=True,
2025-05-07T20:32:30.2829953Z )
2025-05-07T20:32:30.7150448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.7151042Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.7151327Z 
2025-05-07T20:32:30.7151411Z     @given(
2025-05-07T20:32:30.7151667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.7151992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.7152312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.7152939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.7153288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.7153586Z     )
2025-05-07T20:32:30.7153943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.7154504Z     def test_silu_mul_quant(
2025-05-07T20:32:30.7154761Z         self,
2025-05-07T20:32:30.7154962Z         T: int,
2025-05-07T20:32:30.7155177Z         D: int,
2025-05-07T20:32:30.7155414Z         scale_ub: Optional[float],
2025-05-07T20:32:30.7155692Z         contiguous: bool,
2025-05-07T20:32:30.7156075Z         compiled: bool,
2025-05-07T20:32:30.7156321Z     ) -> None:
2025-05-07T20:32:30.7156540Z         torch.manual_seed(2025)
2025-05-07T20:32:30.7156795Z     
2025-05-07T20:32:30.7157088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.7157449Z     
2025-05-07T20:32:30.7157646Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.7157955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.7158283Z         x = x_sign * x_clamp
2025-05-07T20:32:30.7158530Z         x0 = x[:, :D]
2025-05-07T20:32:30.7158760Z         x1 = x[:, D:]
2025-05-07T20:32:30.7158979Z     
2025-05-07T20:32:30.7159166Z         if contiguous:
2025-05-07T20:32:30.7159411Z             x0 = x0.contiguous()
2025-05-07T20:32:30.7159683Z             x1 = x1.contiguous()
2025-05-07T20:32:30.7159924Z     
2025-05-07T20:32:30.7160124Z         if scale_ub is not None:
2025-05-07T20:32:30.7160408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.7160756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.7161080Z             )
2025-05-07T20:32:30.7161282Z         else:
2025-05-07T20:32:30.7161495Z             scale_ub_tensor = None
2025-05-07T20:32:30.7161759Z     
2025-05-07T20:32:30.7162026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7162356Z             op = silu_mul_quant
2025-05-07T20:32:30.7162617Z             if compiled:
2025-05-07T20:32:30.7162875Z                 op = torch.compile(op)
2025-05-07T20:32:30.7163186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.7163469Z     
2025-05-07T20:32:30.7163675Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.7163980Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.7164274Z     
2025-05-07T20:32:30.7164527Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7164871Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.7165176Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.7165502Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.7165880Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7166207Z     
2025-05-07T20:32:30.7166507Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.7166722Z 
2025-05-07T20:32:30.7166831Z moe/activation_test.py:126: 
2025-05-07T20:32:30.7167152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7167497Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.7167843Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7168675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.7169473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.7170037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.7170755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.7171480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.7172343Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.7173149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.7173818Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.7174490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.7175030Z     fn()
2025-05-07T20:32:30.7175561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.7176175Z     self.fn.run(
2025-05-07T20:32:30.7176710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.7177262Z     kernel = self.compile(
2025-05-07T20:32:30.7177832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.7178523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.7178936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7179184Z 
2025-05-07T20:32:30.7179402Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9265c70>
2025-05-07T20:32:30.7180545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.7182109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac902a340>}
2025-05-07T20:32:30.7183516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.7184586Z context = <triton._C.libtriton.ir.context object at 0x7f8ac97ffcf0>
2025-05-07T20:32:30.7184885Z 
2025-05-07T20:32:30.7185060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.7185605Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.7186101Z                            module_map=module_map)
2025-05-07T20:32:30.7186477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.7186853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.7187134Z E       ^
2025-05-07T20:32:30.7187623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.7188093Z 
2025-05-07T20:32:30.7188576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.7189122Z 
2025-05-07T20:32:30.7189230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.7189669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.7190097Z     T=16384,
2025-05-07T20:32:30.7190298Z     D=5120,
2025-05-07T20:32:30.7190505Z     scale_ub=None,
2025-05-07T20:32:30.7190751Z     contiguous=True,
2025-05-07T20:32:30.7190992Z     compiled=True,
2025-05-07T20:32:30.7191207Z )
2025-05-07T20:32:30.7454999Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:30.7456402Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:30.7457801Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:30.7459040Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:30.7460192Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:30.8327736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8328976Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.8329546Z 
2025-05-07T20:32:30.8330061Z     @given(
2025-05-07T20:32:30.8330526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8331164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8331526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8331925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8332268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8332573Z     )
2025-05-07T20:32:30.8332931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8333394Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8333654Z         self,
2025-05-07T20:32:30.8333862Z         T: int,
2025-05-07T20:32:30.8334075Z         D: int,
2025-05-07T20:32:30.8334308Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8334599Z         contiguous: bool,
2025-05-07T20:32:30.8334847Z         compiled: bool,
2025-05-07T20:32:30.8335092Z     ) -> None:
2025-05-07T20:32:30.8335321Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8335573Z     
2025-05-07T20:32:30.8335860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8336220Z     
2025-05-07T20:32:30.8336419Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8336728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8337054Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8337299Z         x0 = x[:, :D]
2025-05-07T20:32:30.8337532Z         x1 = x[:, D:]
2025-05-07T20:32:30.8337753Z     
2025-05-07T20:32:30.8337944Z         if contiguous:
2025-05-07T20:32:30.8338193Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8338466Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8338713Z     
2025-05-07T20:32:30.8338917Z         if scale_ub is not None:
2025-05-07T20:32:30.8339207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8339547Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8339870Z             )
2025-05-07T20:32:30.8340079Z         else:
2025-05-07T20:32:30.8340303Z             scale_ub_tensor = None
2025-05-07T20:32:30.8340558Z     
2025-05-07T20:32:30.8340805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8341234Z             op = silu_mul_quant
2025-05-07T20:32:30.8341536Z             if compiled:
2025-05-07T20:32:30.8341806Z                 op = torch.compile(op)
2025-05-07T20:32:30.8342123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8342404Z     
2025-05-07T20:32:30.8342613Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.8342916Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.8343218Z     
2025-05-07T20:32:30.8343472Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8343824Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.8344125Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.8344458Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.8344837Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.8345162Z     
2025-05-07T20:32:30.8345370Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.8345582Z 
2025-05-07T20:32:30.8345687Z moe/activation_test.py:126: 
2025-05-07T20:32:30.8346004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8346351Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.8346782Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.8347610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.8348471Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.8349034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8349745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8350507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.8351255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.8352018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.8352693Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.8353322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.8353857Z     fn()
2025-05-07T20:32:30.8354394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.8355013Z     self.fn.run(
2025-05-07T20:32:30.8355494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8356054Z     kernel = self.compile(
2025-05-07T20:32:30.8356618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8357304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8357712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8357959Z 
2025-05-07T20:32:30.8358180Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8f67b60>
2025-05-07T20:32:30.8359311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8360762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac87a6d40>}
2025-05-07T20:32:30.8362165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8363274Z context = <triton._C.libtriton.ir.context object at 0x7f8ac88d5ab0>
2025-05-07T20:32:30.8363581Z 
2025-05-07T20:32:30.8363753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8364299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8364778Z                            module_map=module_map)
2025-05-07T20:32:30.8365162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8365540Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.8365822Z E       ^
2025-05-07T20:32:30.8366303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8366779Z 
2025-05-07T20:32:30.8367210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8367744Z 
2025-05-07T20:32:30.8367862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8368301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8368718Z     T=1,
2025-05-07T20:32:30.8368917Z     D=5120,
2025-05-07T20:32:30.8369175Z     scale_ub=1200.0,
2025-05-07T20:32:30.8369411Z     contiguous=True,
2025-05-07T20:32:30.8369650Z     compiled=True,
2025-05-07T20:32:30.8369873Z )
2025-05-07T20:32:30.9806808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9807483Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.9807861Z 
2025-05-07T20:32:30.9807966Z     @given(
2025-05-07T20:32:30.9808199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9808512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9808923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9809256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9809588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9809889Z     )
2025-05-07T20:32:30.9810247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9810699Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9810948Z         self,
2025-05-07T20:32:30.9811146Z         T: int,
2025-05-07T20:32:30.9811343Z         D: int,
2025-05-07T20:32:30.9811590Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9811950Z         contiguous: bool,
2025-05-07T20:32:30.9812190Z         compiled: bool,
2025-05-07T20:32:30.9812422Z     ) -> None:
2025-05-07T20:32:30.9812640Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9812887Z     
2025-05-07T20:32:30.9813160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9813512Z     
2025-05-07T20:32:30.9813710Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9814003Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9814317Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9814563Z         x0 = x[:, :D]
2025-05-07T20:32:30.9814784Z         x1 = x[:, D:]
2025-05-07T20:32:30.9814995Z     
2025-05-07T20:32:30.9815183Z         if contiguous:
2025-05-07T20:32:30.9815419Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9815681Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9815925Z     
2025-05-07T20:32:30.9816113Z         if scale_ub is not None:
2025-05-07T20:32:30.9816391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9816736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9817047Z             )
2025-05-07T20:32:30.9817241Z         else:
2025-05-07T20:32:30.9817455Z             scale_ub_tensor = None
2025-05-07T20:32:30.9817707Z     
2025-05-07T20:32:30.9817943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9818269Z             op = silu_mul_quant
2025-05-07T20:32:30.9818525Z             if compiled:
2025-05-07T20:32:30.9818773Z                 op = torch.compile(op)
2025-05-07T20:32:30.9819167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9819450Z     
2025-05-07T20:32:30.9819645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9819817Z 
2025-05-07T20:32:30.9819920Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9820226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9820561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9820847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9821459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.9822047Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.9822722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9823433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9823987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9824686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9825436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9825987Z     kernel = self.compile(
2025-05-07T20:32:30.9826540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9827256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9827666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9827902Z 
2025-05-07T20:32:30.9828119Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8f64710>
2025-05-07T20:32:30.9829284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9830708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a19e0>}
2025-05-07T20:32:30.9832119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9833184Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8c40cf0>
2025-05-07T20:32:30.9833482Z 
2025-05-07T20:32:30.9833657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9834196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9834675Z                            module_map=module_map)
2025-05-07T20:32:30.9835044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9835409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9835675Z E       ^
2025-05-07T20:32:30.9836152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9836620Z 
2025-05-07T20:32:30.9837058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9837593Z 
2025-05-07T20:32:30.9837701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9838119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9838542Z     T=1,
2025-05-07T20:32:30.9838727Z     D=5120,
2025-05-07T20:32:30.9838919Z     scale_ub=None,
2025-05-07T20:32:30.9839139Z     contiguous=False,
2025-05-07T20:32:30.9839365Z     compiled=True,
2025-05-07T20:32:30.9839564Z )
2025-05-07T20:32:31.2045720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2046358Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.2046783Z 
2025-05-07T20:32:31.2046890Z     @given(
2025-05-07T20:32:31.2047191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2047538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2047848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2048181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2048505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2048790Z     )
2025-05-07T20:32:31.2049143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2049590Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2049836Z         self,
2025-05-07T20:32:31.2050029Z         T: int,
2025-05-07T20:32:31.2050221Z         D: int,
2025-05-07T20:32:31.2050438Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2050709Z         contiguous: bool,
2025-05-07T20:32:31.2050948Z         compiled: bool,
2025-05-07T20:32:31.2051176Z     ) -> None:
2025-05-07T20:32:31.2051413Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2051674Z     
2025-05-07T20:32:31.2052106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2052454Z     
2025-05-07T20:32:31.2052641Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2052927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2053323Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2053559Z         x0 = x[:, :D]
2025-05-07T20:32:31.2053770Z         x1 = x[:, D:]
2025-05-07T20:32:31.2053975Z     
2025-05-07T20:32:31.2054159Z         if contiguous:
2025-05-07T20:32:31.2054383Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2054709Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2054944Z     
2025-05-07T20:32:31.2055128Z         if scale_ub is not None:
2025-05-07T20:32:31.2055398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2055741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2056046Z             )
2025-05-07T20:32:31.2056237Z         else:
2025-05-07T20:32:31.2056446Z             scale_ub_tensor = None
2025-05-07T20:32:31.2056691Z     
2025-05-07T20:32:31.2056920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2057235Z             op = silu_mul_quant
2025-05-07T20:32:31.2057490Z             if compiled:
2025-05-07T20:32:31.2057734Z                 op = torch.compile(op)
2025-05-07T20:32:31.2058029Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2058305Z     
2025-05-07T20:32:31.2058493Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.2058777Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.2059072Z     
2025-05-07T20:32:31.2059303Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2059639Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.2059932Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.2060245Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.2060604Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2060918Z     
2025-05-07T20:32:31.2061114Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.2061313Z 
2025-05-07T20:32:31.2061416Z moe/activation_test.py:126: 
2025-05-07T20:32:31.2061714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2062053Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.2062381Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2063186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.2063963Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.2064518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2065260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2065968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.2066708Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.2067452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.2068106Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.2068719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.2069250Z     fn()
2025-05-07T20:32:31.2069764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.2070358Z     self.fn.run(
2025-05-07T20:32:31.2070835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2071375Z     kernel = self.compile(
2025-05-07T20:32:31.2071971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2072641Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2073083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2073317Z 
2025-05-07T20:32:31.2073529Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8394050>
2025-05-07T20:32:31.2074644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2076103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a0680>}
2025-05-07T20:32:31.2077503Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2078556Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8c1fc70>
2025-05-07T20:32:31.2078854Z 
2025-05-07T20:32:31.2079023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2079553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2080031Z                            module_map=module_map)
2025-05-07T20:32:31.2080407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2080761Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.2081034Z E       ^
2025-05-07T20:32:31.2081527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2081992Z 
2025-05-07T20:32:31.2082425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2082962Z 
2025-05-07T20:32:31.2083070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2083501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2083925Z     T=1,
2025-05-07T20:32:31.2084115Z     D=5120,
2025-05-07T20:32:31.2084316Z     scale_ub=None,
2025-05-07T20:32:31.2084542Z     contiguous=True,
2025-05-07T20:32:31.2084770Z     compiled=False,
2025-05-07T20:32:31.2084986Z )
2025-05-07T20:32:31.3583336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.3583900Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.3584285Z 
2025-05-07T20:32:31.3584433Z     @given(
2025-05-07T20:32:31.3584878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.3585317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.3585732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.3586172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.3586598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.3586962Z     )
2025-05-07T20:32:31.3587322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.3587776Z     def test_silu_mul_quant(
2025-05-07T20:32:31.3588024Z         self,
2025-05-07T20:32:31.3588220Z         T: int,
2025-05-07T20:32:31.3588420Z         D: int,
2025-05-07T20:32:31.3588642Z         scale_ub: Optional[float],
2025-05-07T20:32:31.3588916Z         contiguous: bool,
2025-05-07T20:32:31.3589160Z         compiled: bool,
2025-05-07T20:32:31.3589393Z     ) -> None:
2025-05-07T20:32:31.3589612Z         torch.manual_seed(2025)
2025-05-07T20:32:31.3589866Z     
2025-05-07T20:32:31.3590147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.3590496Z     
2025-05-07T20:32:31.3590779Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.3591082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.3591397Z         x = x_sign * x_clamp
2025-05-07T20:32:31.3591645Z         x0 = x[:, :D]
2025-05-07T20:32:31.3591927Z         x1 = x[:, D:]
2025-05-07T20:32:31.3592136Z     
2025-05-07T20:32:31.3592331Z         if contiguous:
2025-05-07T20:32:31.3592570Z             x0 = x0.contiguous()
2025-05-07T20:32:31.3592833Z             x1 = x1.contiguous()
2025-05-07T20:32:31.3593081Z     
2025-05-07T20:32:31.3593285Z         if scale_ub is not None:
2025-05-07T20:32:31.3593633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.3593970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.3594287Z             )
2025-05-07T20:32:31.3594487Z         else:
2025-05-07T20:32:31.3594703Z             scale_ub_tensor = None
2025-05-07T20:32:31.3594963Z     
2025-05-07T20:32:31.3595203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.3595524Z             op = silu_mul_quant
2025-05-07T20:32:31.3595793Z             if compiled:
2025-05-07T20:32:31.3596049Z                 op = torch.compile(op)
2025-05-07T20:32:31.3596349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3596636Z     
2025-05-07T20:32:31.3596837Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.3597006Z 
2025-05-07T20:32:31.3597110Z moe/activation_test.py:117: 
2025-05-07T20:32:31.3597420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3597767Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.3598062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3598770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.3599514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.3600074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.3600781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.3601489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.3602068Z     kernel = self.compile(
2025-05-07T20:32:31.3602626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.3603303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3603711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3603950Z 
2025-05-07T20:32:31.3604165Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9013bc0>
2025-05-07T20:32:31.3605337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.3606938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a2b60>}
2025-05-07T20:32:31.3608335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.3609391Z context = <triton._C.libtriton.ir.context object at 0x7f8917f166f0>
2025-05-07T20:32:31.3609695Z 
2025-05-07T20:32:31.3609866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.3610414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3610898Z                            module_map=module_map)
2025-05-07T20:32:31.3611267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3611699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3612018Z E       ^
2025-05-07T20:32:31.3612495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.3613030Z 
2025-05-07T20:32:31.3613458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.3613995Z 
2025-05-07T20:32:31.3614102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.3614530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.3615004Z     T=128,
2025-05-07T20:32:31.3615202Z     D=5120,
2025-05-07T20:32:31.3615403Z     scale_ub=None,
2025-05-07T20:32:31.3615622Z     contiguous=False,
2025-05-07T20:32:31.3615861Z     compiled=True,
2025-05-07T20:32:31.3616070Z )
2025-05-07T20:32:31.3616399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.3616913Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.3617190Z 
2025-05-07T20:32:31.3617278Z     @given(
2025-05-07T20:32:31.3617515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.3617842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.3618159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.3625454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.3625838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.3626137Z     )
2025-05-07T20:32:31.3626516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.3626975Z     def test_silu_mul_quant(
2025-05-07T20:32:31.3627230Z         self,
2025-05-07T20:32:31.3627435Z         T: int,
2025-05-07T20:32:31.3627637Z         D: int,
2025-05-07T20:32:31.3627867Z         scale_ub: Optional[float],
2025-05-07T20:32:31.3628148Z         contiguous: bool,
2025-05-07T20:32:31.3628398Z         compiled: bool,
2025-05-07T20:32:31.3628631Z     ) -> None:
2025-05-07T20:32:31.3628857Z         torch.manual_seed(2025)
2025-05-07T20:32:31.3629113Z     
2025-05-07T20:32:31.3629392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.3629752Z     
2025-05-07T20:32:31.3629954Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.3630250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.3630575Z         x = x_sign * x_clamp
2025-05-07T20:32:31.3630824Z         x0 = x[:, :D]
2025-05-07T20:32:31.3631052Z         x1 = x[:, D:]
2025-05-07T20:32:31.3631276Z     
2025-05-07T20:32:31.3631471Z         if contiguous:
2025-05-07T20:32:31.3631705Z             x0 = x0.contiguous()
2025-05-07T20:32:31.3631974Z             x1 = x1.contiguous()
2025-05-07T20:32:31.3632327Z     
2025-05-07T20:32:31.3632525Z         if scale_ub is not None:
2025-05-07T20:32:31.3632808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.3633160Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.3633480Z             )
2025-05-07T20:32:31.3633674Z         else:
2025-05-07T20:32:31.3633893Z             scale_ub_tensor = None
2025-05-07T20:32:31.3634155Z     
2025-05-07T20:32:31.3634389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.3634715Z             op = silu_mul_quant
2025-05-07T20:32:31.3634974Z             if compiled:
2025-05-07T20:32:31.3635225Z                 op = torch.compile(op)
2025-05-07T20:32:31.3635532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3635823Z     
2025-05-07T20:32:31.3636018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.3636193Z 
2025-05-07T20:32:31.3636298Z moe/activation_test.py:117: 
2025-05-07T20:32:31.3636613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3636954Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.3637249Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.3637883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.3638468Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.3639145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.3639900Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.3640456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.3641193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.3641881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.3642436Z     kernel = self.compile(
2025-05-07T20:32:31.3642996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.3643674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3644088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.3644326Z 
2025-05-07T20:32:31.3644551Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee5b9640>
2025-05-07T20:32:31.3645680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.3647106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a2de0>}
2025-05-07T20:32:31.3648505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.3649573Z context = <triton._C.libtriton.ir.context object at 0x7f8917f10530>
2025-05-07T20:32:31.3649876Z 
2025-05-07T20:32:31.3650054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.3650596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3651079Z                            module_map=module_map)
2025-05-07T20:32:31.3651455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3651903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3652175Z E       ^
2025-05-07T20:32:31.3652661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.3653128Z 
2025-05-07T20:32:31.3653615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.3654152Z 
2025-05-07T20:32:31.3654267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.3654693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.3655113Z     T=128,
2025-05-07T20:32:31.3655313Z     D=7168,
2025-05-07T20:32:31.3655511Z     scale_ub=1200.0,
2025-05-07T20:32:31.3655747Z     contiguous=False,
2025-05-07T20:32:31.3655982Z     compiled=False,
2025-05-07T20:32:31.3656192Z )
2025-05-07T20:32:31.4776936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4777673Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.4778137Z 
2025-05-07T20:32:31.4778253Z     @given(
2025-05-07T20:32:31.4778562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4778967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4779283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4779616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4780094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4780393Z     )
2025-05-07T20:32:31.4780748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4781269Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4781519Z         self,
2025-05-07T20:32:31.4781718Z         T: int,
2025-05-07T20:32:31.4781926Z         D: int,
2025-05-07T20:32:31.4782156Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4782430Z         contiguous: bool,
2025-05-07T20:32:31.4782673Z         compiled: bool,
2025-05-07T20:32:31.4782969Z     ) -> None:
2025-05-07T20:32:31.4783187Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4783428Z     
2025-05-07T20:32:31.4783706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4784063Z     
2025-05-07T20:32:31.4784261Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4784557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4784875Z         x = x_sign * x_clamp
2025-05-07T20:32:31.4785116Z         x0 = x[:, :D]
2025-05-07T20:32:31.4785338Z         x1 = x[:, D:]
2025-05-07T20:32:31.4785549Z     
2025-05-07T20:32:31.4785734Z         if contiguous:
2025-05-07T20:32:31.4785982Z             x0 = x0.contiguous()
2025-05-07T20:32:31.4786244Z             x1 = x1.contiguous()
2025-05-07T20:32:31.4786487Z     
2025-05-07T20:32:31.4786686Z         if scale_ub is not None:
2025-05-07T20:32:31.4786963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.4787299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.4787621Z             )
2025-05-07T20:32:31.4787819Z         else:
2025-05-07T20:32:31.4788034Z             scale_ub_tensor = None
2025-05-07T20:32:31.4788286Z     
2025-05-07T20:32:31.4788524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.4788847Z             op = silu_mul_quant
2025-05-07T20:32:31.4789098Z             if compiled:
2025-05-07T20:32:31.4789350Z                 op = torch.compile(op)
2025-05-07T20:32:31.4789651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4789929Z     
2025-05-07T20:32:31.4790124Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.4790291Z 
2025-05-07T20:32:31.4790398Z moe/activation_test.py:117: 
2025-05-07T20:32:31.4790699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4791039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.4791324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4792087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.4792795Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.4793427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.4794134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.4794816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.4795367Z     kernel = self.compile(
2025-05-07T20:32:31.4795923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.4796602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4797006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4797248Z 
2025-05-07T20:32:31.4797465Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b10bc0>
2025-05-07T20:32:31.4798593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.4800067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57ccc0>}
2025-05-07T20:32:31.4801460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.4802618Z context = <triton._C.libtriton.ir.context object at 0x7f8917f66730>
2025-05-07T20:32:31.4802917Z 
2025-05-07T20:32:31.4803088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.4803667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4804149Z                            module_map=module_map)
2025-05-07T20:32:31.4804523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4804887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.4805152Z E       ^
2025-05-07T20:32:31.4805627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4806097Z 
2025-05-07T20:32:31.4806789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.4807424Z 
2025-05-07T20:32:31.4807535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4807950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4808358Z     T=128,
2025-05-07T20:32:31.4808553Z     D=5120,
2025-05-07T20:32:31.4808750Z     scale_ub=None,
2025-05-07T20:32:31.4808963Z     contiguous=False,
2025-05-07T20:32:31.4809192Z     compiled=False,
2025-05-07T20:32:31.4809394Z )
2025-05-07T20:32:31.4809718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4810222Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.4810494Z 
2025-05-07T20:32:31.4810582Z     @given(
2025-05-07T20:32:31.4810806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4811121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4811434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4811763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4812159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4812446Z     )
2025-05-07T20:32:31.4812797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4813245Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4813491Z         self,
2025-05-07T20:32:31.4813685Z         T: int,
2025-05-07T20:32:31.4813881Z         D: int,
2025-05-07T20:32:31.4814101Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4814465Z         contiguous: bool,
2025-05-07T20:32:31.4814706Z         compiled: bool,
2025-05-07T20:32:31.4814929Z     ) -> None:
2025-05-07T20:32:31.4815150Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4815396Z     
2025-05-07T20:32:31.4815666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4816013Z     
2025-05-07T20:32:31.4816205Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4816496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4816812Z         x = x_sign * x_clamp
2025-05-07T20:32:31.4817060Z         x0 = x[:, :D]
2025-05-07T20:32:31.4817285Z         x1 = x[:, D:]
2025-05-07T20:32:31.4817489Z     
2025-05-07T20:32:31.4817676Z         if contiguous:
2025-05-07T20:32:31.4817915Z             x0 = x0.contiguous()
2025-05-07T20:32:31.4818169Z             x1 = x1.contiguous()
2025-05-07T20:32:31.4818409Z     
2025-05-07T20:32:31.4818603Z         if scale_ub is not None:
2025-05-07T20:32:31.4818874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.4819210Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.4819526Z             )
2025-05-07T20:32:31.4819720Z         else:
2025-05-07T20:32:31.4819999Z             scale_ub_tensor = None
2025-05-07T20:32:31.4820255Z     
2025-05-07T20:32:31.4820486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.4820812Z             op = silu_mul_quant
2025-05-07T20:32:31.4821150Z             if compiled:
2025-05-07T20:32:31.4821418Z                 op = torch.compile(op)
2025-05-07T20:32:31.4821751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4822031Z     
2025-05-07T20:32:31.4822224Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.4822395Z 
2025-05-07T20:32:31.4822562Z moe/activation_test.py:117: 
2025-05-07T20:32:31.4822863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4823203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.4823489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4824201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.4824910Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.4825458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.4826166Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.4826852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.4827402Z     kernel = self.compile(
2025-05-07T20:32:31.4827954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.4828637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4829049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4829288Z 
2025-05-07T20:32:31.4829503Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b11130>
2025-05-07T20:32:31.4830619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.4832049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee58cea0>}
2025-05-07T20:32:31.4833441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.4834507Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80401f0>
2025-05-07T20:32:31.4834806Z 
2025-05-07T20:32:31.4835022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.4835565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4836046Z                            module_map=module_map)
2025-05-07T20:32:31.4836420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4836782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.4837049Z E       ^
2025-05-07T20:32:31.4837525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4837988Z 
2025-05-07T20:32:31.4838425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.4838961Z 
2025-05-07T20:32:31.4839068Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4839489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4839908Z     T=128,
2025-05-07T20:32:31.4840097Z     D=5120,
2025-05-07T20:32:31.4840292Z     scale_ub=1200.0,
2025-05-07T20:32:31.4840518Z     contiguous=True,
2025-05-07T20:32:31.4840790Z     compiled=False,
2025-05-07T20:32:31.4841000Z )
2025-05-07T20:32:31.6560340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6561672Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.6562244Z 
2025-05-07T20:32:31.6562352Z     @given(
2025-05-07T20:32:31.6562641Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6563037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6563432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6563873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6564193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6564474Z     )
2025-05-07T20:32:31.6564827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6565269Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6565509Z         self,
2025-05-07T20:32:31.6565704Z         T: int,
2025-05-07T20:32:31.6565897Z         D: int,
2025-05-07T20:32:31.6566113Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6566383Z         contiguous: bool,
2025-05-07T20:32:31.6566624Z         compiled: bool,
2025-05-07T20:32:31.6566852Z     ) -> None:
2025-05-07T20:32:31.6567069Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6567309Z     
2025-05-07T20:32:31.6567577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6567919Z     
2025-05-07T20:32:31.6568110Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6568397Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6568709Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6568945Z         x0 = x[:, :D]
2025-05-07T20:32:31.6569155Z         x1 = x[:, D:]
2025-05-07T20:32:31.6569365Z     
2025-05-07T20:32:31.6569551Z         if contiguous:
2025-05-07T20:32:31.6569781Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6570046Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6570287Z     
2025-05-07T20:32:31.6570478Z         if scale_ub is not None:
2025-05-07T20:32:31.6570753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6571090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6571407Z             )
2025-05-07T20:32:31.6571595Z         else:
2025-05-07T20:32:31.6571806Z             scale_ub_tensor = None
2025-05-07T20:32:31.6572124Z     
2025-05-07T20:32:31.6572350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6572670Z             op = silu_mul_quant
2025-05-07T20:32:31.6572935Z             if compiled:
2025-05-07T20:32:31.6573182Z                 op = torch.compile(op)
2025-05-07T20:32:31.6573487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6573765Z     
2025-05-07T20:32:31.6574029Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6574200Z 
2025-05-07T20:32:31.6574300Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6574601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6574934Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6575219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6575922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6576631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6577171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6577868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6578547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6579095Z     kernel = self.compile(
2025-05-07T20:32:31.6579641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6580377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6580784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6581056Z 
2025-05-07T20:32:31.6581271Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824fb60>
2025-05-07T20:32:31.6582383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6583843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822cc20>}
2025-05-07T20:32:31.6585433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6586489Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80b7530>
2025-05-07T20:32:31.6586781Z 
2025-05-07T20:32:31.6586949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6587481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6587954Z                            module_map=module_map)
2025-05-07T20:32:31.6588316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6588672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6588934Z E       ^
2025-05-07T20:32:31.6589409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6589870Z 
2025-05-07T20:32:31.6590296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6590825Z 
2025-05-07T20:32:31.6590929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6591352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6591763Z     T=1,
2025-05-07T20:32:31.6591947Z     D=7168,
2025-05-07T20:32:31.6592138Z     scale_ub=1200.0,
2025-05-07T20:32:31.6592362Z     contiguous=True,
2025-05-07T20:32:31.6592580Z     compiled=True,
2025-05-07T20:32:31.6592780Z )
2025-05-07T20:32:31.6593104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6593597Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:31.6593873Z 
2025-05-07T20:32:31.6593954Z     @given(
2025-05-07T20:32:31.6594184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6594545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6594858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6595189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6595524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6595807Z     )
2025-05-07T20:32:31.6596157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6596614Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6596853Z         self,
2025-05-07T20:32:31.6597049Z         T: int,
2025-05-07T20:32:31.6597247Z         D: int,
2025-05-07T20:32:31.6597461Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6597737Z         contiguous: bool,
2025-05-07T20:32:31.6597978Z         compiled: bool,
2025-05-07T20:32:31.6598199Z     ) -> None:
2025-05-07T20:32:31.6598414Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6598662Z     
2025-05-07T20:32:31.6598929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6599276Z     
2025-05-07T20:32:31.6599471Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6599757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6600113Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6600355Z         x0 = x[:, :D]
2025-05-07T20:32:31.6600570Z         x1 = x[:, D:]
2025-05-07T20:32:31.6600773Z     
2025-05-07T20:32:31.6600956Z         if contiguous:
2025-05-07T20:32:31.6601228Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6601479Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6601719Z     
2025-05-07T20:32:31.6601908Z         if scale_ub is not None:
2025-05-07T20:32:31.6602213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6602560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6602912Z             )
2025-05-07T20:32:31.6603100Z         else:
2025-05-07T20:32:31.6603308Z             scale_ub_tensor = None
2025-05-07T20:32:31.6603560Z     
2025-05-07T20:32:31.6603789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6604108Z             op = silu_mul_quant
2025-05-07T20:32:31.6604360Z             if compiled:
2025-05-07T20:32:31.6604605Z                 op = torch.compile(op)
2025-05-07T20:32:31.6604904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6605182Z     
2025-05-07T20:32:31.6605376Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6605549Z 
2025-05-07T20:32:31.6605649Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6605947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6606440Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6606726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6615185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.6615814Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.6616502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6617210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6617758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6618457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6619141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6619687Z     kernel = self.compile(
2025-05-07T20:32:31.6620236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6620909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6621319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6621552Z 
2025-05-07T20:32:31.6621875Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824d610>
2025-05-07T20:32:31.6622994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6624417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822e2a0>}
2025-05-07T20:32:31.6625817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6626877Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80d1e70>
2025-05-07T20:32:31.6627172Z 
2025-05-07T20:32:31.6627347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6627881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6628361Z                            module_map=module_map)
2025-05-07T20:32:31.6628796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6629156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6629416Z E       ^
2025-05-07T20:32:31.6629892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6630415Z 
2025-05-07T20:32:31.6630848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6631376Z 
2025-05-07T20:32:31.6631486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6632020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6632432Z     T=1,
2025-05-07T20:32:31.6632615Z     D=7168,
2025-05-07T20:32:31.6632811Z     scale_ub=1200.0,
2025-05-07T20:32:31.6633043Z     contiguous=False,
2025-05-07T20:32:31.6633268Z     compiled=True,
2025-05-07T20:32:31.6633477Z )
2025-05-07T20:32:31.7949681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7950407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.7950780Z 
2025-05-07T20:32:31.7950884Z     @given(
2025-05-07T20:32:31.7951182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7951640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7952232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7952917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7953564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7954130Z     )
2025-05-07T20:32:31.7954824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7955719Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7956204Z         self,
2025-05-07T20:32:31.7956579Z         T: int,
2025-05-07T20:32:31.7956967Z         D: int,
2025-05-07T20:32:31.7957395Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7957934Z         contiguous: bool,
2025-05-07T20:32:31.7958398Z         compiled: bool,
2025-05-07T20:32:31.7958838Z     ) -> None:
2025-05-07T20:32:31.7959262Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7959736Z     
2025-05-07T20:32:31.7960270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7960946Z     
2025-05-07T20:32:31.7961317Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.7961748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.7962060Z         x = x_sign * x_clamp
2025-05-07T20:32:31.7962300Z         x0 = x[:, :D]
2025-05-07T20:32:31.7962520Z         x1 = x[:, D:]
2025-05-07T20:32:31.7962728Z     
2025-05-07T20:32:31.7962908Z         if contiguous:
2025-05-07T20:32:31.7963139Z             x0 = x0.contiguous()
2025-05-07T20:32:31.7963515Z             x1 = x1.contiguous()
2025-05-07T20:32:31.7963756Z     
2025-05-07T20:32:31.7963945Z         if scale_ub is not None:
2025-05-07T20:32:31.7964221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.7964561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.7964870Z             )
2025-05-07T20:32:31.7965063Z         else:
2025-05-07T20:32:31.7965273Z             scale_ub_tensor = None
2025-05-07T20:32:31.7965520Z     
2025-05-07T20:32:31.7965753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7966072Z             op = silu_mul_quant
2025-05-07T20:32:31.7966320Z             if compiled:
2025-05-07T20:32:31.7966563Z                 op = torch.compile(op)
2025-05-07T20:32:31.7966861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7967131Z     
2025-05-07T20:32:31.7967325Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.7967490Z 
2025-05-07T20:32:31.7967592Z moe/activation_test.py:117: 
2025-05-07T20:32:31.7967889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7968220Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.7968569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7969140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.7969710Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.7970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.7971154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.7971695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.7972520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.7973201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.7973744Z     kernel = self.compile(
2025-05-07T20:32:31.7974293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.7974963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.7975368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7975601Z 
2025-05-07T20:32:31.7975818Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824fef0>
2025-05-07T20:32:31.7976924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.7978344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822f9c0>}
2025-05-07T20:32:31.7979724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.7980780Z context = <triton._C.libtriton.ir.context object at 0x7f8ac81a32b0>
2025-05-07T20:32:31.7981073Z 
2025-05-07T20:32:31.7981242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.7981774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.7982248Z                            module_map=module_map)
2025-05-07T20:32:31.7982614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.7982973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.7983232Z E       ^
2025-05-07T20:32:31.7983755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.7984219Z 
2025-05-07T20:32:31.7984646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7985177Z 
2025-05-07T20:32:31.7985278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7985694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7986102Z     T=1,
2025-05-07T20:32:31.7986281Z     D=7168,
2025-05-07T20:32:31.7986470Z     scale_ub=None,
2025-05-07T20:32:31.7986685Z     contiguous=False,
2025-05-07T20:32:31.7986909Z     compiled=True,
2025-05-07T20:32:31.7987107Z )
2025-05-07T20:32:31.8841266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8842016Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.8842381Z 
2025-05-07T20:32:31.8842490Z     @given(
2025-05-07T20:32:31.8842791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8843164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8843476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8843906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8844240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8844528Z     )
2025-05-07T20:32:31.8844874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8845380Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8845620Z         self,
2025-05-07T20:32:31.8845811Z         T: int,
2025-05-07T20:32:31.8846002Z         D: int,
2025-05-07T20:32:31.8846218Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8846482Z         contiguous: bool,
2025-05-07T20:32:31.8846790Z         compiled: bool,
2025-05-07T20:32:31.8847010Z     ) -> None:
2025-05-07T20:32:31.8847219Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8847464Z     
2025-05-07T20:32:31.8847737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8848082Z     
2025-05-07T20:32:31.8848267Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8848562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8848870Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8849106Z         x0 = x[:, :D]
2025-05-07T20:32:31.8849321Z         x1 = x[:, D:]
2025-05-07T20:32:31.8849528Z     
2025-05-07T20:32:31.8849714Z         if contiguous:
2025-05-07T20:32:31.8849949Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8850208Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8850440Z     
2025-05-07T20:32:31.8850629Z         if scale_ub is not None:
2025-05-07T20:32:31.8850901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8851235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8851545Z             )
2025-05-07T20:32:31.8851733Z         else:
2025-05-07T20:32:31.8852020Z             scale_ub_tensor = None
2025-05-07T20:32:31.8852278Z     
2025-05-07T20:32:31.8852510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8852823Z             op = silu_mul_quant
2025-05-07T20:32:31.8853072Z             if compiled:
2025-05-07T20:32:31.8853316Z                 op = torch.compile(op)
2025-05-07T20:32:31.8853610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8853884Z     
2025-05-07T20:32:31.8854076Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.8854361Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.8854648Z     
2025-05-07T20:32:31.8854884Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8855219Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.8855509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.8855826Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.8856183Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8856493Z     
2025-05-07T20:32:31.8856764Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.8856964Z 
2025-05-07T20:32:31.8857063Z moe/activation_test.py:126: 
2025-05-07T20:32:31.8857362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8857694Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.8858024Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8858832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.8859605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.8860158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8860856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8861560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.8862289Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8863072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.8863727Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.8864402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.8864925Z     fn()
2025-05-07T20:32:31.8865432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.8866022Z     self.fn.run(
2025-05-07T20:32:31.8866532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8867067Z     kernel = self.compile(
2025-05-07T20:32:31.8867616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8868282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8868689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8868920Z 
2025-05-07T20:32:31.8869144Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812e300>
2025-05-07T20:32:31.8870258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8871698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8130b80>}
2025-05-07T20:32:31.8873103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8874153Z context = <triton._C.libtriton.ir.context object at 0x7f8ac81db4f0>
2025-05-07T20:32:31.8874447Z 
2025-05-07T20:32:31.8874613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8875139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8875611Z                            module_map=module_map)
2025-05-07T20:32:31.8875978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8876332Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.8876595Z E       ^
2025-05-07T20:32:31.8877072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8877534Z 
2025-05-07T20:32:31.8878005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8878538Z 
2025-05-07T20:32:31.8878643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8879062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8879472Z     T=1,
2025-05-07T20:32:31.8879652Z     D=5120,
2025-05-07T20:32:31.8879845Z     scale_ub=1200.0,
2025-05-07T20:32:31.8880071Z     contiguous=False,
2025-05-07T20:32:31.8880295Z     compiled=True,
2025-05-07T20:32:31.8880496Z )
2025-05-07T20:32:32.0432479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0433941Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.0434698Z 
2025-05-07T20:32:32.0434921Z     @given(
2025-05-07T20:32:32.0435542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0436193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0436805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0437478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0438143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0438711Z     )
2025-05-07T20:32:32.0439620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0440525Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0441008Z         self,
2025-05-07T20:32:32.0441507Z         T: int,
2025-05-07T20:32:32.0441753Z         D: int,
2025-05-07T20:32:32.0441971Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0442248Z         contiguous: bool,
2025-05-07T20:32:32.0442492Z         compiled: bool,
2025-05-07T20:32:32.0442717Z     ) -> None:
2025-05-07T20:32:32.0442935Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0443246Z     
2025-05-07T20:32:32.0443528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0443879Z     
2025-05-07T20:32:32.0444074Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0444377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0444690Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0444934Z         x0 = x[:, :D]
2025-05-07T20:32:32.0445158Z         x1 = x[:, D:]
2025-05-07T20:32:32.0445364Z     
2025-05-07T20:32:32.0445557Z         if contiguous:
2025-05-07T20:32:32.0445791Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0446048Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0446295Z     
2025-05-07T20:32:32.0446491Z         if scale_ub is not None:
2025-05-07T20:32:32.0446764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0447107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0447425Z             )
2025-05-07T20:32:32.0447618Z         else:
2025-05-07T20:32:32.0447837Z             scale_ub_tensor = None
2025-05-07T20:32:32.0448097Z     
2025-05-07T20:32:32.0448327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0448648Z             op = silu_mul_quant
2025-05-07T20:32:32.0448908Z             if compiled:
2025-05-07T20:32:32.0449161Z                 op = torch.compile(op)
2025-05-07T20:32:32.0449459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0449737Z     
2025-05-07T20:32:32.0449933Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0450099Z 
2025-05-07T20:32:32.0450200Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0450503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0450845Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0451127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0451720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.0452400Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.0453079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0453854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0454408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0455114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0455802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0456348Z     kernel = self.compile(
2025-05-07T20:32:32.0456902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0457580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0457983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0458232Z 
2025-05-07T20:32:32.0458446Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812ea20>
2025-05-07T20:32:32.0459614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0461037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8131e40>}
2025-05-07T20:32:32.0462469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0463522Z context = <triton._C.libtriton.ir.context object at 0x7f8917eb0bf0>
2025-05-07T20:32:32.0463862Z 
2025-05-07T20:32:32.0464033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0464572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0465058Z                            module_map=module_map)
2025-05-07T20:32:32.0465428Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0465794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0466058Z E       ^
2025-05-07T20:32:32.0466533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0467003Z 
2025-05-07T20:32:32.0467433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0467964Z 
2025-05-07T20:32:32.0468077Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0468504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0468917Z     T=1,
2025-05-07T20:32:32.0469107Z     D=5120,
2025-05-07T20:32:32.0469304Z     scale_ub=1200.0,
2025-05-07T20:32:32.0469528Z     contiguous=False,
2025-05-07T20:32:32.0469760Z     compiled=False,
2025-05-07T20:32:32.0469974Z )
2025-05-07T20:32:32.0470295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0470806Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.0471087Z 
2025-05-07T20:32:32.0471173Z     @given(
2025-05-07T20:32:32.0471407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0471726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0472036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0472373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0472702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0472994Z     )
2025-05-07T20:32:32.0473353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0473803Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0474052Z         self,
2025-05-07T20:32:32.0474252Z         T: int,
2025-05-07T20:32:32.0474503Z         D: int,
2025-05-07T20:32:32.0474728Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0475001Z         contiguous: bool,
2025-05-07T20:32:32.0475241Z         compiled: bool,
2025-05-07T20:32:32.0475475Z     ) -> None:
2025-05-07T20:32:32.0475689Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0475931Z     
2025-05-07T20:32:32.0476206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0476556Z     
2025-05-07T20:32:32.0476749Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0477044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0477359Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0477603Z         x0 = x[:, :D]
2025-05-07T20:32:32.0477820Z         x1 = x[:, D:]
2025-05-07T20:32:32.0478035Z     
2025-05-07T20:32:32.0478224Z         if contiguous:
2025-05-07T20:32:32.0478455Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0478715Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0478955Z     
2025-05-07T20:32:32.0479146Z         if scale_ub is not None:
2025-05-07T20:32:32.0479419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0479807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0480118Z             )
2025-05-07T20:32:32.0480315Z         else:
2025-05-07T20:32:32.0480531Z             scale_ub_tensor = None
2025-05-07T20:32:32.0480780Z     
2025-05-07T20:32:32.0481075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0481398Z             op = silu_mul_quant
2025-05-07T20:32:32.0481657Z             if compiled:
2025-05-07T20:32:32.0481908Z                 op = torch.compile(op)
2025-05-07T20:32:32.0482259Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0482580Z     
2025-05-07T20:32:32.0482772Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0482942Z 
2025-05-07T20:32:32.0483042Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0483348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0483688Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0483967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0484678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0485385Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0485937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0486644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0487331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0487882Z     kernel = self.compile(
2025-05-07T20:32:32.0488433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0489116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0489526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0489763Z 
2025-05-07T20:32:32.0489978Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812ffb0>
2025-05-07T20:32:32.0491095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0492565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8132ac0>}
2025-05-07T20:32:32.0493957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0495069Z context = <triton._C.libtriton.ir.context object at 0x7f8917a50e30>
2025-05-07T20:32:32.0495366Z 
2025-05-07T20:32:32.0495550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0496088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0496573Z                            module_map=module_map)
2025-05-07T20:32:32.0496949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0497308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0497579Z E       ^
2025-05-07T20:32:32.0504866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0505342Z 
2025-05-07T20:32:32.0505780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0506697Z 
2025-05-07T20:32:32.0506807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0507240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0507648Z     T=16384,
2025-05-07T20:32:32.0507951Z     D=5120,
2025-05-07T20:32:32.0508153Z     scale_ub=1200.0,
2025-05-07T20:32:32.0508380Z     contiguous=False,
2025-05-07T20:32:32.0508607Z     compiled=True,
2025-05-07T20:32:32.0508816Z )
2025-05-07T20:32:32.1366318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1367215Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.1367615Z 
2025-05-07T20:32:32.1367720Z     @given(
2025-05-07T20:32:32.1368028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1368517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1368826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1369159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1369491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1369781Z     )
2025-05-07T20:32:32.1370138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1370592Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1370841Z         self,
2025-05-07T20:32:32.1371040Z         T: int,
2025-05-07T20:32:32.1371241Z         D: int,
2025-05-07T20:32:32.1371465Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1371740Z         contiguous: bool,
2025-05-07T20:32:32.1372049Z         compiled: bool,
2025-05-07T20:32:32.1372276Z     ) -> None:
2025-05-07T20:32:32.1372494Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1372740Z     
2025-05-07T20:32:32.1373013Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1373362Z     
2025-05-07T20:32:32.1373558Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1373850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1374167Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1374415Z         x0 = x[:, :D]
2025-05-07T20:32:32.1374633Z         x1 = x[:, D:]
2025-05-07T20:32:32.1374849Z     
2025-05-07T20:32:32.1375040Z         if contiguous:
2025-05-07T20:32:32.1375276Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1375541Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1375782Z     
2025-05-07T20:32:32.1375970Z         if scale_ub is not None:
2025-05-07T20:32:32.1376250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1376592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1376902Z             )
2025-05-07T20:32:32.1377099Z         else:
2025-05-07T20:32:32.1377312Z             scale_ub_tensor = None
2025-05-07T20:32:32.1377573Z     
2025-05-07T20:32:32.1377808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1378129Z             op = silu_mul_quant
2025-05-07T20:32:32.1378388Z             if compiled:
2025-05-07T20:32:32.1378635Z                 op = torch.compile(op)
2025-05-07T20:32:32.1379030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1379315Z     
2025-05-07T20:32:32.1379510Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.1379679Z 
2025-05-07T20:32:32.1379783Z moe/activation_test.py:117: 
2025-05-07T20:32:32.1380088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1380424Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.1380721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1381294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.1381869Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.1382547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.1383260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.1383818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1384522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1385282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1385840Z     kernel = self.compile(
2025-05-07T20:32:32.1386396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1387122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1387535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1387770Z 
2025-05-07T20:32:32.1387988Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e304a0>
2025-05-07T20:32:32.1389151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1390572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4c180>}
2025-05-07T20:32:32.1391965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1393023Z context = <triton._C.libtriton.ir.context object at 0x7f8917ee9470>
2025-05-07T20:32:32.1393319Z 
2025-05-07T20:32:32.1393495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1394034Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1394520Z                            module_map=module_map)
2025-05-07T20:32:32.1394894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1395259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1395520Z E       ^
2025-05-07T20:32:32.1396000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1396464Z 
2025-05-07T20:32:32.1396898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1397432Z 
2025-05-07T20:32:32.1397540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.1397956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.1398371Z     T=2048,
2025-05-07T20:32:32.1398564Z     D=7168,
2025-05-07T20:32:32.1398764Z     scale_ub=1200.0,
2025-05-07T20:32:32.1398991Z     contiguous=False,
2025-05-07T20:32:32.1399222Z     compiled=True,
2025-05-07T20:32:32.1399426Z )
2025-05-07T20:32:32.1399802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1400317Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.1400601Z 
2025-05-07T20:32:32.1400681Z     @given(
2025-05-07T20:32:32.1400916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1401233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1401543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1401904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1402268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1402556Z     )
2025-05-07T20:32:32.1402908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1403362Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1403611Z         self,
2025-05-07T20:32:32.1403807Z         T: int,
2025-05-07T20:32:32.1404014Z         D: int,
2025-05-07T20:32:32.1404239Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1404511Z         contiguous: bool,
2025-05-07T20:32:32.1404761Z         compiled: bool,
2025-05-07T20:32:32.1404989Z     ) -> None:
2025-05-07T20:32:32.1405209Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1405503Z     
2025-05-07T20:32:32.1405780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1406294Z     
2025-05-07T20:32:32.1406488Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1406858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1407174Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1407416Z         x0 = x[:, :D]
2025-05-07T20:32:32.1407637Z         x1 = x[:, D:]
2025-05-07T20:32:32.1407845Z     
2025-05-07T20:32:32.1408031Z         if contiguous:
2025-05-07T20:32:32.1408271Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1408656Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1408897Z     
2025-05-07T20:32:32.1409090Z         if scale_ub is not None:
2025-05-07T20:32:32.1409366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1409705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1410021Z             )
2025-05-07T20:32:32.1410216Z         else:
2025-05-07T20:32:32.1410427Z             scale_ub_tensor = None
2025-05-07T20:32:32.1410679Z     
2025-05-07T20:32:32.1410912Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1411233Z             op = silu_mul_quant
2025-05-07T20:32:32.1411488Z             if compiled:
2025-05-07T20:32:32.1411733Z                 op = torch.compile(op)
2025-05-07T20:32:32.1412106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1412382Z     
2025-05-07T20:32:32.1412578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.1412744Z 
2025-05-07T20:32:32.1412852Z moe/activation_test.py:117: 
2025-05-07T20:32:32.1413155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1413491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.1413781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1414353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.1414924Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.1415603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.1416310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.1416858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1417551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1418238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1418795Z     kernel = self.compile(
2025-05-07T20:32:32.1419419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1420100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1420514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1420751Z 
2025-05-07T20:32:32.1420966Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e32ea0>
2025-05-07T20:32:32.1422076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1423499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4cea0>}
2025-05-07T20:32:32.1424892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1425952Z context = <triton._C.libtriton.ir.context object at 0x7f8ac84504b0>
2025-05-07T20:32:32.1426249Z 
2025-05-07T20:32:32.1426496Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1427032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1427549Z                            module_map=module_map)
2025-05-07T20:32:32.1427923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1428279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1428548Z E       ^
2025-05-07T20:32:32.1429028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1429560Z 
2025-05-07T20:32:32.1429992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1430525Z 
2025-05-07T20:32:32.2578448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.2578960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.2579598Z     T=1,
2025-05-07T20:32:32.2579890Z     D=5120,
2025-05-07T20:32:32.2580150Z     scale_ub=None,
2025-05-07T20:32:32.2580438Z     contiguous=False,
2025-05-07T20:32:32.2580715Z     compiled=False,
2025-05-07T20:32:32.2580925Z )
2025-05-07T20:32:32.2581243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.2581738Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.2582009Z 
2025-05-07T20:32:32.2582086Z     @given(
2025-05-07T20:32:32.2582319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.2582640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.2582950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.2583286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.2583611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.2583899Z     )
2025-05-07T20:32:32.2584252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.2584695Z     def test_silu_mul_quant(
2025-05-07T20:32:32.2584944Z         self,
2025-05-07T20:32:32.2585141Z         T: int,
2025-05-07T20:32:32.2585340Z         D: int,
2025-05-07T20:32:32.2585563Z         scale_ub: Optional[float],
2025-05-07T20:32:32.2585833Z         contiguous: bool,
2025-05-07T20:32:32.2586069Z         compiled: bool,
2025-05-07T20:32:32.2586295Z     ) -> None:
2025-05-07T20:32:32.2586513Z         torch.manual_seed(2025)
2025-05-07T20:32:32.2586757Z     
2025-05-07T20:32:32.2587035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.2587381Z     
2025-05-07T20:32:32.2587576Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.2587862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.2588286Z         x = x_sign * x_clamp
2025-05-07T20:32:32.2588532Z         x0 = x[:, :D]
2025-05-07T20:32:32.2588744Z         x1 = x[:, D:]
2025-05-07T20:32:32.2588951Z     
2025-05-07T20:32:32.2589142Z         if contiguous:
2025-05-07T20:32:32.2589371Z             x0 = x0.contiguous()
2025-05-07T20:32:32.2589629Z             x1 = x1.contiguous()
2025-05-07T20:32:32.2589870Z     
2025-05-07T20:32:32.2590060Z         if scale_ub is not None:
2025-05-07T20:32:32.2590336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.2590672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.2590982Z             )
2025-05-07T20:32:32.2591181Z         else:
2025-05-07T20:32:32.2591400Z             scale_ub_tensor = None
2025-05-07T20:32:32.2591653Z     
2025-05-07T20:32:32.2591888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.2592208Z             op = silu_mul_quant
2025-05-07T20:32:32.2592465Z             if compiled:
2025-05-07T20:32:32.2592712Z                 op = torch.compile(op)
2025-05-07T20:32:32.2593012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2593288Z     
2025-05-07T20:32:32.2593546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.2593720Z 
2025-05-07T20:32:32.2593821Z moe/activation_test.py:117: 
2025-05-07T20:32:32.2594119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2594508Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.2594793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2595503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.2596211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.2596820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.2597528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.2598211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.2598764Z     kernel = self.compile(
2025-05-07T20:32:32.2599318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.2600001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.2600409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2600645Z 
2025-05-07T20:32:32.2600855Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e33650>
2025-05-07T20:32:32.2601974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.2603404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4de40>}
2025-05-07T20:32:32.2604797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.2605861Z context = <triton._C.libtriton.ir.context object at 0x7f8ac84e92f0>
2025-05-07T20:32:32.2606348Z 
2025-05-07T20:32:32.2606523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.2607064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.2607544Z                            module_map=module_map)
2025-05-07T20:32:32.2607910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.2608277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.2608542Z E       ^
2025-05-07T20:32:32.2609083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.2609555Z 
2025-05-07T20:32:32.2609987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.2610524Z 
2025-05-07T20:32:32.2610628Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.2611051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.2611462Z     T=4096,
2025-05-07T20:32:32.2611655Z     D=7168,
2025-05-07T20:32:32.2611903Z     scale_ub=1200.0,
2025-05-07T20:32:32.2612129Z     contiguous=False,
2025-05-07T20:32:32.2612359Z     compiled=False,
2025-05-07T20:32:32.2612568Z )
2025-05-07T20:32:32.2612894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.2613403Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.2613697Z 
2025-05-07T20:32:32.2613776Z     @given(
2025-05-07T20:32:32.2614009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.2614391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.2614706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.2615039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.2615367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.2615719Z     )
2025-05-07T20:32:32.2616078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.2616526Z     def test_silu_mul_quant(
2025-05-07T20:32:32.2616771Z         self,
2025-05-07T20:32:32.2616969Z         T: int,
2025-05-07T20:32:32.2617234Z         D: int,
2025-05-07T20:32:32.2617451Z         scale_ub: Optional[float],
2025-05-07T20:32:32.2617725Z         contiguous: bool,
2025-05-07T20:32:32.2617968Z         compiled: bool,
2025-05-07T20:32:32.2618189Z     ) -> None:
2025-05-07T20:32:32.2618410Z         torch.manual_seed(2025)
2025-05-07T20:32:32.2618656Z     
2025-05-07T20:32:32.2618925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.2619278Z     
2025-05-07T20:32:32.2619473Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.2619764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.2620083Z         x = x_sign * x_clamp
2025-05-07T20:32:32.2620329Z         x0 = x[:, :D]
2025-05-07T20:32:32.2620543Z         x1 = x[:, D:]
2025-05-07T20:32:32.2620754Z     
2025-05-07T20:32:32.2620942Z         if contiguous:
2025-05-07T20:32:32.2621172Z             x0 = x0.contiguous()
2025-05-07T20:32:32.2621436Z             x1 = x1.contiguous()
2025-05-07T20:32:32.2621699Z     
2025-05-07T20:32:32.2621923Z         if scale_ub is not None:
2025-05-07T20:32:32.2622196Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.2622535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.2622849Z             )
2025-05-07T20:32:32.2623043Z         else:
2025-05-07T20:32:32.2623257Z             scale_ub_tensor = None
2025-05-07T20:32:32.2623513Z     
2025-05-07T20:32:32.2623746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.2624068Z             op = silu_mul_quant
2025-05-07T20:32:32.2624328Z             if compiled:
2025-05-07T20:32:32.2624574Z                 op = torch.compile(op)
2025-05-07T20:32:32.2624880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2625159Z     
2025-05-07T20:32:32.2625351Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.2625522Z 
2025-05-07T20:32:32.2625622Z moe/activation_test.py:117: 
2025-05-07T20:32:32.2625923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2626264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.2626545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.2627306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.2628014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.2628563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.2629262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.2629947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.2630495Z     kernel = self.compile(
2025-05-07T20:32:32.2631043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.2631721Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.2632132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.2632368Z 
2025-05-07T20:32:32.2632583Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e32ff0>
2025-05-07T20:32:32.2633737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.2635157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4f380>}
2025-05-07T20:32:32.2636585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.2637682Z context = <triton._C.libtriton.ir.context object at 0x7f8917d7a3f0>
2025-05-07T20:32:32.2637975Z 
2025-05-07T20:32:32.2638144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.2638688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.2639162Z                            module_map=module_map)
2025-05-07T20:32:32.2639533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.2639890Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.2640158Z E       ^
2025-05-07T20:32:32.2640638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.2641108Z 
2025-05-07T20:32:32.2641539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.2642072Z 
2025-05-07T20:32:32.2642180Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.2642643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.2643066Z     T=16384,
2025-05-07T20:32:32.2643259Z     D=7168,
2025-05-07T20:32:32.2643453Z     scale_ub=None,
2025-05-07T20:32:32.2643674Z     contiguous=True,
2025-05-07T20:32:32.2643897Z     compiled=True,
2025-05-07T20:32:32.2644101Z )
2025-05-07T20:32:32.4387410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4388164Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.4388591Z 
2025-05-07T20:32:32.4388702Z     @given(
2025-05-07T20:32:32.4389010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4389336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4389639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4389968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4390289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4390577Z     )
2025-05-07T20:32:32.4390929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4391367Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4391726Z         self,
2025-05-07T20:32:32.4391934Z         T: int,
2025-05-07T20:32:32.4392132Z         D: int,
2025-05-07T20:32:32.4392361Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4392658Z         contiguous: bool,
2025-05-07T20:32:32.4392908Z         compiled: bool,
2025-05-07T20:32:32.4393141Z     ) -> None:
2025-05-07T20:32:32.4393367Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4393620Z     
2025-05-07T20:32:32.4393914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4394294Z     
2025-05-07T20:32:32.4394488Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4394799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4395138Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4395395Z         x0 = x[:, :D]
2025-05-07T20:32:32.4395618Z         x1 = x[:, D:]
2025-05-07T20:32:32.4395834Z     
2025-05-07T20:32:32.4396024Z         if contiguous:
2025-05-07T20:32:32.4396261Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4396540Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4396797Z     
2025-05-07T20:32:32.4396992Z         if scale_ub is not None:
2025-05-07T20:32:32.4397353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4397724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4398060Z             )
2025-05-07T20:32:32.4398258Z         else:
2025-05-07T20:32:32.4398554Z             scale_ub_tensor = None
2025-05-07T20:32:32.4405263Z     
2025-05-07T20:32:32.4405541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4405866Z             op = silu_mul_quant
2025-05-07T20:32:32.4406361Z             if compiled:
2025-05-07T20:32:32.4406610Z                 op = torch.compile(op)
2025-05-07T20:32:32.4407026Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4407302Z     
2025-05-07T20:32:32.4407496Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4407666Z 
2025-05-07T20:32:32.4407767Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4408069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4408401Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4408686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4409276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.4409855Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.4410527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4411233Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4411779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4412523Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4413199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4413741Z     kernel = self.compile(
2025-05-07T20:32:32.4414295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4414962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4415366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4415601Z 
2025-05-07T20:32:32.4415815Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d808f0>
2025-05-07T20:32:32.4416927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4418420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dac4a0>}
2025-05-07T20:32:32.4419806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4420859Z context = <triton._C.libtriton.ir.context object at 0x7f8917dea2f0>
2025-05-07T20:32:32.4421151Z 
2025-05-07T20:32:32.4421325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4421854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4422330Z                            module_map=module_map)
2025-05-07T20:32:32.4422698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4423060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4423316Z E       ^
2025-05-07T20:32:32.4423792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4424255Z 
2025-05-07T20:32:32.4424750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4425281Z 
2025-05-07T20:32:32.4425384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4425805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4426274Z     T=4096,
2025-05-07T20:32:32.4426464Z     D=5120,
2025-05-07T20:32:32.4426654Z     scale_ub=None,
2025-05-07T20:32:32.4426874Z     contiguous=False,
2025-05-07T20:32:32.4427102Z     compiled=True,
2025-05-07T20:32:32.4427299Z )
2025-05-07T20:32:32.4427623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4428173Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.4428455Z 
2025-05-07T20:32:32.4428533Z     @given(
2025-05-07T20:32:32.4428765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4429082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4429387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4429727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4430058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4430343Z     )
2025-05-07T20:32:32.4430701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4431153Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4431397Z         self,
2025-05-07T20:32:32.4431585Z         T: int,
2025-05-07T20:32:32.4431788Z         D: int,
2025-05-07T20:32:32.4432003Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4432271Z         contiguous: bool,
2025-05-07T20:32:32.4432521Z         compiled: bool,
2025-05-07T20:32:32.4432747Z     ) -> None:
2025-05-07T20:32:32.4432957Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4433204Z     
2025-05-07T20:32:32.4433481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4433819Z     
2025-05-07T20:32:32.4434012Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4434307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4434620Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4434858Z         x0 = x[:, :D]
2025-05-07T20:32:32.4435074Z         x1 = x[:, D:]
2025-05-07T20:32:32.4435280Z     
2025-05-07T20:32:32.4435463Z         if contiguous:
2025-05-07T20:32:32.4435696Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4435954Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4436191Z     
2025-05-07T20:32:32.4436381Z         if scale_ub is not None:
2025-05-07T20:32:32.4436651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4436990Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4437299Z             )
2025-05-07T20:32:32.4437494Z         else:
2025-05-07T20:32:32.4437700Z             scale_ub_tensor = None
2025-05-07T20:32:32.4437954Z     
2025-05-07T20:32:32.4438238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4438554Z             op = silu_mul_quant
2025-05-07T20:32:32.4438806Z             if compiled:
2025-05-07T20:32:32.4439061Z                 op = torch.compile(op)
2025-05-07T20:32:32.4439353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4439629Z     
2025-05-07T20:32:32.4439824Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4439987Z 
2025-05-07T20:32:32.4440090Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4440386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4440723Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4441005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4441570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.4442137Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.4442814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4443516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4444104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4444801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4445523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4446061Z     kernel = self.compile(
2025-05-07T20:32:32.4446610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4447320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4447724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4447956Z 
2025-05-07T20:32:32.4448170Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d80fb0>
2025-05-07T20:32:32.4449282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4450700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dad1c0>}
2025-05-07T20:32:32.4452140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4453212Z context = <triton._C.libtriton.ir.context object at 0x7f8917d15170>
2025-05-07T20:32:32.4453509Z 
2025-05-07T20:32:32.4453681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4454220Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4454706Z                            module_map=module_map)
2025-05-07T20:32:32.4455072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4455430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4455691Z E       ^
2025-05-07T20:32:32.4456173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4456644Z 
2025-05-07T20:32:32.4457080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4457623Z 
2025-05-07T20:32:32.7536645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7537248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7537826Z     T=4096,
2025-05-07T20:32:32.7538078Z     D=5120,
2025-05-07T20:32:32.7538533Z     scale_ub=1200.0,
2025-05-07T20:32:32.7538833Z     contiguous=False,
2025-05-07T20:32:32.7539142Z     compiled=False,
2025-05-07T20:32:32.7539416Z )
2025-05-07T20:32:32.7539743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7540255Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.7540550Z 
2025-05-07T20:32:32.7540627Z     @given(
2025-05-07T20:32:32.7540861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7541169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7541477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7541834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7542187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7542471Z     )
2025-05-07T20:32:32.7542830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7543279Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7543524Z         self,
2025-05-07T20:32:32.7543721Z         T: int,
2025-05-07T20:32:32.7543916Z         D: int,
2025-05-07T20:32:32.7544212Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7544493Z         contiguous: bool,
2025-05-07T20:32:32.7544735Z         compiled: bool,
2025-05-07T20:32:32.7544957Z     ) -> None:
2025-05-07T20:32:32.7545244Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7545486Z     
2025-05-07T20:32:32.7545758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7546106Z     
2025-05-07T20:32:32.7546302Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7546592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7546968Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7547210Z         x0 = x[:, :D]
2025-05-07T20:32:32.7547423Z         x1 = x[:, D:]
2025-05-07T20:32:32.7547636Z     
2025-05-07T20:32:32.7547823Z         if contiguous:
2025-05-07T20:32:32.7548056Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7548316Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7548560Z     
2025-05-07T20:32:32.7548755Z         if scale_ub is not None:
2025-05-07T20:32:32.7549036Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7549377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7549684Z             )
2025-05-07T20:32:32.7549882Z         else:
2025-05-07T20:32:32.7550096Z             scale_ub_tensor = None
2025-05-07T20:32:32.7550349Z     
2025-05-07T20:32:32.7550575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7550891Z             op = silu_mul_quant
2025-05-07T20:32:32.7551144Z             if compiled:
2025-05-07T20:32:32.7551390Z                 op = torch.compile(op)
2025-05-07T20:32:32.7551692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7552001Z     
2025-05-07T20:32:32.7552216Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7552387Z 
2025-05-07T20:32:32.7552488Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7552788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7553123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7553407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7554118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7554827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7555372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7556074Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7556753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7557296Z     kernel = self.compile(
2025-05-07T20:32:32.7557901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7558580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7558980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7559223Z 
2025-05-07T20:32:32.7559431Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d83b30>
2025-05-07T20:32:32.7560548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7561969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dae160>}
2025-05-07T20:32:32.7563403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7564529Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8b3a930>
2025-05-07T20:32:32.7564829Z 
2025-05-07T20:32:32.7564998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7565530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7566042Z                            module_map=module_map)
2025-05-07T20:32:32.7566410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7566766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7567025Z E       ^
2025-05-07T20:32:32.7567537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7568004Z 
2025-05-07T20:32:32.7568435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7568963Z 
2025-05-07T20:32:32.7569070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7569488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7569900Z     T=4096,
2025-05-07T20:32:32.7570086Z     D=5120,
2025-05-07T20:32:32.7570274Z     scale_ub=1200.0,
2025-05-07T20:32:32.7570499Z     contiguous=False,
2025-05-07T20:32:32.7570727Z     compiled=True,
2025-05-07T20:32:32.7570930Z )
2025-05-07T20:32:32.7571249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7571756Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.7572114Z 
2025-05-07T20:32:32.7572196Z     @given(
2025-05-07T20:32:32.7572420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7572734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7573045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7573371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7573701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7573990Z     )
2025-05-07T20:32:32.7574344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7574791Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7575037Z         self,
2025-05-07T20:32:32.7575234Z         T: int,
2025-05-07T20:32:32.7575428Z         D: int,
2025-05-07T20:32:32.7575642Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7575915Z         contiguous: bool,
2025-05-07T20:32:32.7576152Z         compiled: bool,
2025-05-07T20:32:32.7576375Z     ) -> None:
2025-05-07T20:32:32.7576598Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7576843Z     
2025-05-07T20:32:32.7577115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7577458Z     
2025-05-07T20:32:32.7577645Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7577986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7578305Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7578542Z         x0 = x[:, :D]
2025-05-07T20:32:32.7578758Z         x1 = x[:, D:]
2025-05-07T20:32:32.7578967Z     
2025-05-07T20:32:32.7579150Z         if contiguous:
2025-05-07T20:32:32.7579381Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7579643Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7579882Z     
2025-05-07T20:32:32.7580068Z         if scale_ub is not None:
2025-05-07T20:32:32.7580340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7580676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7580982Z             )
2025-05-07T20:32:32.7581180Z         else:
2025-05-07T20:32:32.7581389Z             scale_ub_tensor = None
2025-05-07T20:32:32.7581637Z     
2025-05-07T20:32:32.7581875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7582195Z             op = silu_mul_quant
2025-05-07T20:32:32.7582444Z             if compiled:
2025-05-07T20:32:32.7582691Z                 op = torch.compile(op)
2025-05-07T20:32:32.7583034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7583306Z     
2025-05-07T20:32:32.7583506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7583670Z 
2025-05-07T20:32:32.7583773Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7584116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7584449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7584729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7585296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7585903Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7586576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7587282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7587842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7588537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7589220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7589765Z     kernel = self.compile(
2025-05-07T20:32:32.7590315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7590984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7591392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7591626Z 
2025-05-07T20:32:32.7591837Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d81cd0>
2025-05-07T20:32:32.7592949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7594367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917daf240>}
2025-05-07T20:32:32.7595757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7596811Z context = <triton._C.libtriton.ir.context object at 0x7f8917ab3cf0>
2025-05-07T20:32:32.7597108Z 
2025-05-07T20:32:32.7597283Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7597860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7598341Z                            module_map=module_map)
2025-05-07T20:32:32.7598715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7599072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7599331Z E       ^
2025-05-07T20:32:32.7599806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7600271Z 
2025-05-07T20:32:32.7600704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7601234Z 
2025-05-07T20:32:32.8742449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8743869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8744708Z     T=2048,
2025-05-07T20:32:32.8745019Z     D=7168,
2025-05-07T20:32:32.8745332Z     scale_ub=1200.0,
2025-05-07T20:32:32.8745709Z     contiguous=False,
2025-05-07T20:32:32.8746089Z     compiled=False,
2025-05-07T20:32:32.8746433Z )
2025-05-07T20:32:32.8746983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8748229Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8748744Z 
2025-05-07T20:32:32.8748881Z     @given(
2025-05-07T20:32:32.8749253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8749931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8750463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8751036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8751619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8752125Z     )
2025-05-07T20:32:32.8752867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8753667Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8754074Z         self,
2025-05-07T20:32:32.8754378Z         T: int,
2025-05-07T20:32:32.8754676Z         D: int,
2025-05-07T20:32:32.8755019Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8755454Z         contiguous: bool,
2025-05-07T20:32:32.8755835Z         compiled: bool,
2025-05-07T20:32:32.8756199Z     ) -> None:
2025-05-07T20:32:32.8756546Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8756940Z     
2025-05-07T20:32:32.8757400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8758013Z     
2025-05-07T20:32:32.8758322Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8758818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8759355Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8759756Z         x0 = x[:, :D]
2025-05-07T20:32:32.8760128Z         x1 = x[:, D:]
2025-05-07T20:32:32.8760481Z     
2025-05-07T20:32:32.8760780Z         if contiguous:
2025-05-07T20:32:32.8761174Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8761606Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8762074Z     
2025-05-07T20:32:32.8762380Z         if scale_ub is not None:
2025-05-07T20:32:32.8762844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8763424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8763969Z             )
2025-05-07T20:32:32.8764305Z         else:
2025-05-07T20:32:32.8764670Z             scale_ub_tensor = None
2025-05-07T20:32:32.8765102Z     
2025-05-07T20:32:32.8765495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8766050Z             op = silu_mul_quant
2025-05-07T20:32:32.8766466Z             if compiled:
2025-05-07T20:32:32.8766882Z                 op = torch.compile(op)
2025-05-07T20:32:32.8767392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8767861Z     
2025-05-07T20:32:32.8768181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8768465Z 
2025-05-07T20:32:32.8768642Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8769293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8769877Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8770361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8771651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8773051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8774041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8775323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8776480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8777244Z     kernel = self.compile(
2025-05-07T20:32:32.8778007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8778956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8779499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8779931Z 
2025-05-07T20:32:32.8780213Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa1820>
2025-05-07T20:32:32.8781822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8784126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a54220>}
2025-05-07T20:32:32.8786267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8787834Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8b4a3f0>
2025-05-07T20:32:32.8788267Z 
2025-05-07T20:32:32.8788541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8789405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8790180Z                            module_map=module_map)
2025-05-07T20:32:32.8790757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8791318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8791729Z E       ^
2025-05-07T20:32:32.8792498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8793268Z 
2025-05-07T20:32:32.8793968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8794848Z 
2025-05-07T20:32:32.8795008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8795680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8808068Z     T=1,
2025-05-07T20:32:32.8808409Z     D=7168,
2025-05-07T20:32:32.8808711Z     scale_ub=None,
2025-05-07T20:32:32.8809042Z     contiguous=True,
2025-05-07T20:32:32.8809380Z     compiled=False,
2025-05-07T20:32:32.8809700Z )
2025-05-07T20:32:32.8810217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8811013Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8811454Z 
2025-05-07T20:32:32.8811571Z     @given(
2025-05-07T20:32:32.8812021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8812517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8813002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8813529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8814192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8814639Z     )
2025-05-07T20:32:32.8815205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8815929Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8816299Z         self,
2025-05-07T20:32:32.8816594Z         T: int,
2025-05-07T20:32:32.8816893Z         D: int,
2025-05-07T20:32:32.8817220Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8817646Z         contiguous: bool,
2025-05-07T20:32:32.8818014Z         compiled: bool,
2025-05-07T20:32:32.8818351Z     ) -> None:
2025-05-07T20:32:32.8818679Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8819056Z     
2025-05-07T20:32:32.8819470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8820020Z     
2025-05-07T20:32:32.8820315Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8820761Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8821262Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8821637Z         x0 = x[:, :D]
2025-05-07T20:32:32.8821969Z         x1 = x[:, D:]
2025-05-07T20:32:32.8822282Z     
2025-05-07T20:32:32.8822651Z         if contiguous:
2025-05-07T20:32:32.8823011Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8823407Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8823784Z     
2025-05-07T20:32:32.8824080Z         if scale_ub is not None:
2025-05-07T20:32:32.8824574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8825105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8825599Z             )
2025-05-07T20:32:32.8825888Z         else:
2025-05-07T20:32:32.8826211Z             scale_ub_tensor = None
2025-05-07T20:32:32.8826604Z     
2025-05-07T20:32:32.8827028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8827527Z             op = silu_mul_quant
2025-05-07T20:32:32.8827915Z             if compiled:
2025-05-07T20:32:32.8828294Z                 op = torch.compile(op)
2025-05-07T20:32:32.8828762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8829197Z     
2025-05-07T20:32:32.8829483Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8829754Z 
2025-05-07T20:32:32.8829905Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8830375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8830908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8831348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8832559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8833727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8834610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8835752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8836864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8837752Z     kernel = self.compile(
2025-05-07T20:32:32.8838636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8839731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8840374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8840747Z 
2025-05-07T20:32:32.8841075Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa0e30>
2025-05-07T20:32:32.8842902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8845329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a55120>}
2025-05-07T20:32:32.8847631Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8849358Z context = <triton._C.libtriton.ir.context object at 0x7f8917bd8630>
2025-05-07T20:32:32.8849832Z 
2025-05-07T20:32:32.8850091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8850950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8851717Z                            module_map=module_map)
2025-05-07T20:32:32.8852377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8852925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8853331Z E       ^
2025-05-07T20:32:32.8854094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8854854Z 
2025-05-07T20:32:32.8855614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8856485Z 
2025-05-07T20:32:32.8856643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8857305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8858010Z     T=16384,
2025-05-07T20:32:32.8858297Z     D=7168,
2025-05-07T20:32:32.8858593Z     scale_ub=1200.0,
2025-05-07T20:32:32.8858938Z     contiguous=False,
2025-05-07T20:32:32.8859276Z     compiled=True,
2025-05-07T20:32:33.1232970Z )
2025-05-07T20:32:33.1233910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1234763Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.1235254Z 
2025-05-07T20:32:33.1235374Z     @given(
2025-05-07T20:32:33.1235738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1236251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1236746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1237281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1237824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1238248Z     )
2025-05-07T20:32:33.1238807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1239573Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1239952Z         self,
2025-05-07T20:32:33.1240257Z         T: int,
2025-05-07T20:32:33.1240562Z         D: int,
2025-05-07T20:32:33.1240897Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1241322Z         contiguous: bool,
2025-05-07T20:32:33.1241698Z         compiled: bool,
2025-05-07T20:32:33.1242050Z     ) -> None:
2025-05-07T20:32:33.1242398Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1242804Z     
2025-05-07T20:32:33.1243229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1243801Z     
2025-05-07T20:32:33.1244114Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1244603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1245106Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1245509Z         x0 = x[:, :D]
2025-05-07T20:32:33.1245853Z         x1 = x[:, D:]
2025-05-07T20:32:33.1246190Z     
2025-05-07T20:32:33.1246484Z         if contiguous:
2025-05-07T20:32:33.1246854Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1247262Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1247658Z     
2025-05-07T20:32:33.1247958Z         if scale_ub is not None:
2025-05-07T20:32:33.1248403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1249005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1249504Z             )
2025-05-07T20:32:33.1249811Z         else:
2025-05-07T20:32:33.1250270Z             scale_ub_tensor = None
2025-05-07T20:32:33.1250673Z     
2025-05-07T20:32:33.1251027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1251546Z             op = silu_mul_quant
2025-05-07T20:32:33.1252055Z             if compiled:
2025-05-07T20:32:33.1252451Z                 op = torch.compile(op)
2025-05-07T20:32:33.1252933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1253386Z     
2025-05-07T20:32:33.1253669Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1253904Z 
2025-05-07T20:32:33.1254036Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1254439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1254905Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1255298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1256098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.1256921Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.1258071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1259201Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1260019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1261267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1262378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1263307Z     kernel = self.compile(
2025-05-07T20:32:33.1264176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1265468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1266155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1266571Z 
2025-05-07T20:32:33.1266929Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa25a0>
2025-05-07T20:32:33.1268928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1271431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a56520>}
2025-05-07T20:32:33.1273823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1275560Z context = <triton._C.libtriton.ir.context object at 0x7f8917b263b0>
2025-05-07T20:32:33.1276003Z 
2025-05-07T20:32:33.1276268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1277103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1277873Z                            module_map=module_map)
2025-05-07T20:32:33.1278472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1279029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.1279443Z E       ^
2025-05-07T20:32:33.1280225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1281016Z 
2025-05-07T20:32:33.1281755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1282671Z 
2025-05-07T20:32:33.1282832Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1283602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1284279Z     T=1,
2025-05-07T20:32:33.1284576Z     D=7168,
2025-05-07T20:32:33.1284899Z     scale_ub=None,
2025-05-07T20:32:33.1285260Z     contiguous=False,
2025-05-07T20:32:33.1285615Z     compiled=False,
2025-05-07T20:32:33.1285936Z )
2025-05-07T20:32:33.1286409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1287187Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:33.1287668Z 
2025-05-07T20:32:33.1287792Z     @given(
2025-05-07T20:32:33.1288172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1288713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1289233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1289815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1290378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1290893Z     )
2025-05-07T20:32:33.1291513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1292486Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1292897Z         self,
2025-05-07T20:32:33.1293308Z         T: int,
2025-05-07T20:32:33.1293628Z         D: int,
2025-05-07T20:32:33.1293989Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1294451Z         contiguous: bool,
2025-05-07T20:32:33.1294840Z         compiled: bool,
2025-05-07T20:32:33.1295332Z     ) -> None:
2025-05-07T20:32:33.1295685Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1296085Z     
2025-05-07T20:32:33.1296547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1297149Z     
2025-05-07T20:32:33.1297464Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1298010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1298540Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1298902Z         x0 = x[:, :D]
2025-05-07T20:32:33.1299228Z         x1 = x[:, D:]
2025-05-07T20:32:33.1299553Z     
2025-05-07T20:32:33.1299838Z         if contiguous:
2025-05-07T20:32:33.1300189Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1300605Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1300993Z     
2025-05-07T20:32:33.1301284Z         if scale_ub is not None:
2025-05-07T20:32:33.1301742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1302319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1302845Z             )
2025-05-07T20:32:33.1303160Z         else:
2025-05-07T20:32:33.1303506Z             scale_ub_tensor = None
2025-05-07T20:32:33.1303924Z     
2025-05-07T20:32:33.1304308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1304856Z             op = silu_mul_quant
2025-05-07T20:32:33.1305282Z             if compiled:
2025-05-07T20:32:33.1305685Z                 op = torch.compile(op)
2025-05-07T20:32:33.1306520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1307011Z     
2025-05-07T20:32:33.1307320Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1307596Z 
2025-05-07T20:32:33.1307756Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1308262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1308837Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1309322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1310608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1311913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1312889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1314165Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1315412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1316525Z     kernel = self.compile(
2025-05-07T20:32:33.1317525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1318753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1319459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1319875Z 
2025-05-07T20:32:33.1320227Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa3b90>
2025-05-07T20:32:33.1321843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1323820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a57100>}
2025-05-07T20:32:33.1325930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1327459Z context = <triton._C.libtriton.ir.context object at 0x7f8917862d30>
2025-05-07T20:32:33.1327878Z 
2025-05-07T20:32:33.1328132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1329056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1329749Z                            module_map=module_map)
2025-05-07T20:32:33.1330281Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1330785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.1331330Z E       ^
2025-05-07T20:32:33.1332139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1332908Z 
2025-05-07T20:32:33.1333615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1334506Z 
2025-05-07T20:32:33.1334671Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1335308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1335985Z     T=2048,
2025-05-07T20:32:33.1336299Z     D=7168,
2025-05-07T20:32:33.1336610Z     scale_ub=None,
2025-05-07T20:32:33.1336957Z     contiguous=False,
2025-05-07T20:32:33.1337330Z     compiled=True,
2025-05-07T20:32:33.1337665Z )
2025-05-07T20:32:33.2202663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2203637Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.2204161Z 
2025-05-07T20:32:33.2204299Z     @given(
2025-05-07T20:32:33.2204672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2205222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2205736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2206636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2207230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2207720Z     )
2025-05-07T20:32:33.2208325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2209130Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2209543Z         self,
2025-05-07T20:32:33.2209856Z         T: int,
2025-05-07T20:32:33.2210181Z         D: int,
2025-05-07T20:32:33.2210537Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2211000Z         contiguous: bool,
2025-05-07T20:32:33.2211388Z         compiled: bool,
2025-05-07T20:32:33.2211772Z     ) -> None:
2025-05-07T20:32:33.2212214Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2212619Z     
2025-05-07T20:32:33.2213075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2213971Z     
2025-05-07T20:32:33.2214280Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2214736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2215231Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2215604Z         x0 = x[:, :D]
2025-05-07T20:32:33.2215950Z         x1 = x[:, D:]
2025-05-07T20:32:33.2216276Z     
2025-05-07T20:32:33.2216559Z         if contiguous:
2025-05-07T20:32:33.2216930Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2217358Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2217763Z     
2025-05-07T20:32:33.2218080Z         if scale_ub is not None:
2025-05-07T20:32:33.2218558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2219142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2219663Z             )
2025-05-07T20:32:33.2219972Z         else:
2025-05-07T20:32:33.2220313Z             scale_ub_tensor = None
2025-05-07T20:32:33.2220732Z     
2025-05-07T20:32:33.2221112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2221635Z             op = silu_mul_quant
2025-05-07T20:32:33.2222055Z             if compiled:
2025-05-07T20:32:33.2222597Z                 op = torch.compile(op)
2025-05-07T20:32:33.2223085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2223544Z     
2025-05-07T20:32:33.2223856Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2224243Z 
2025-05-07T20:32:33.2224413Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2224906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2225483Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2225960Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2226983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.2228124Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.2229351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2230644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2231616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2232938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2234177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2235161Z     kernel = self.compile(
2025-05-07T20:32:33.2236125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2237081Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2237636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2237960Z 
2025-05-07T20:32:33.2238254Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5b410>
2025-05-07T20:32:33.2239769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2241824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b44720>}
2025-05-07T20:32:33.2243880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2245452Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8bee970>
2025-05-07T20:32:33.2245881Z 
2025-05-07T20:32:33.2246118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2247002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2247709Z                            module_map=module_map)
2025-05-07T20:32:33.2248274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2248824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2249234Z E       ^
2025-05-07T20:32:33.2249997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2250767Z 
2025-05-07T20:32:33.2251467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2252484Z 
2025-05-07T20:32:33.2252642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2253313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2253963Z     T=4096,
2025-05-07T20:32:33.2254244Z     D=7168,
2025-05-07T20:32:33.2254536Z     scale_ub=None,
2025-05-07T20:32:33.2254871Z     contiguous=False,
2025-05-07T20:32:33.2255208Z     compiled=True,
2025-05-07T20:32:33.2255525Z )
2025-05-07T20:32:33.2256679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2257491Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.2257945Z 
2025-05-07T20:32:33.2258059Z     @given(
2025-05-07T20:32:33.2258463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2258956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2259429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2259957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2260480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2260974Z     )
2025-05-07T20:32:33.2261539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2262267Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2262637Z         self,
2025-05-07T20:32:33.2262932Z         T: int,
2025-05-07T20:32:33.2263235Z         D: int,
2025-05-07T20:32:33.2263560Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2263983Z         contiguous: bool,
2025-05-07T20:32:33.2264355Z         compiled: bool,
2025-05-07T20:32:33.2264693Z     ) -> None:
2025-05-07T20:32:33.2265025Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2265406Z     
2025-05-07T20:32:33.2265822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2266362Z     
2025-05-07T20:32:33.2266654Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2267104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2267584Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2267962Z         x0 = x[:, :D]
2025-05-07T20:32:33.2268291Z         x1 = x[:, D:]
2025-05-07T20:32:33.2268599Z     
2025-05-07T20:32:33.2268876Z         if contiguous:
2025-05-07T20:32:33.2269228Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2269626Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2270001Z     
2025-05-07T20:32:33.2270294Z         if scale_ub is not None:
2025-05-07T20:32:33.2270728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2281824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2282320Z             )
2025-05-07T20:32:33.2282613Z         else:
2025-05-07T20:32:33.2282933Z             scale_ub_tensor = None
2025-05-07T20:32:33.2283331Z     
2025-05-07T20:32:33.2283693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2284185Z             op = silu_mul_quant
2025-05-07T20:32:33.2284574Z             if compiled:
2025-05-07T20:32:33.2284953Z                 op = torch.compile(op)
2025-05-07T20:32:33.2285415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2285847Z     
2025-05-07T20:32:33.2286137Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2286396Z 
2025-05-07T20:32:33.2286547Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2287092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2287629Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2288070Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2288985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.2289928Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.2291036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2292270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2293153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2294301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2295417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2296295Z     kernel = self.compile(
2025-05-07T20:32:33.2297247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2298341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2298970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2299400Z 
2025-05-07T20:32:33.2299725Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b58f20>
2025-05-07T20:32:33.2301553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2303974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b45440>}
2025-05-07T20:32:33.2306546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2308281Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8bf1d70>
2025-05-07T20:32:33.2308768Z 
2025-05-07T20:32:33.2309028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2309885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2310612Z                            module_map=module_map)
2025-05-07T20:32:33.2311144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2311672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2312052Z E       ^
2025-05-07T20:32:33.2312752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2313472Z 
2025-05-07T20:32:33.2314134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2315002Z 
2025-05-07T20:32:33.3867545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3868364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3869115Z     T=16384,
2025-05-07T20:32:33.3869429Z     D=5120,
2025-05-07T20:32:33.3869750Z     scale_ub=1200.0,
2025-05-07T20:32:33.3870116Z     contiguous=False,
2025-05-07T20:32:33.3870472Z     compiled=False,
2025-05-07T20:32:33.3870810Z )
2025-05-07T20:32:33.3871359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.3872269Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.3872780Z 
2025-05-07T20:32:33.3872904Z     @given(
2025-05-07T20:32:33.3873607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.3874164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.3874683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.3875266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.3875839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.3876326Z     )
2025-05-07T20:32:33.3876942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.3877750Z     def test_silu_mul_quant(
2025-05-07T20:32:33.3878142Z         self,
2025-05-07T20:32:33.3878462Z         T: int,
2025-05-07T20:32:33.3878769Z         D: int,
2025-05-07T20:32:33.3879100Z         scale_ub: Optional[float],
2025-05-07T20:32:33.3879517Z         contiguous: bool,
2025-05-07T20:32:33.3879898Z         compiled: bool,
2025-05-07T20:32:33.3880258Z     ) -> None:
2025-05-07T20:32:33.3880585Z         torch.manual_seed(2025)
2025-05-07T20:32:33.3880968Z     
2025-05-07T20:32:33.3881406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.3882010Z     
2025-05-07T20:32:33.3882360Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.3882978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.3883506Z         x = x_sign * x_clamp
2025-05-07T20:32:33.3883903Z         x0 = x[:, :D]
2025-05-07T20:32:33.3884255Z         x1 = x[:, D:]
2025-05-07T20:32:33.3884699Z     
2025-05-07T20:32:33.3884995Z         if contiguous:
2025-05-07T20:32:33.3885376Z             x0 = x0.contiguous()
2025-05-07T20:32:33.3885801Z             x1 = x1.contiguous()
2025-05-07T20:32:33.3886199Z     
2025-05-07T20:32:33.3886515Z         if scale_ub is not None:
2025-05-07T20:32:33.3886971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.3887655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.3888190Z             )
2025-05-07T20:32:33.3888505Z         else:
2025-05-07T20:32:33.3888843Z             scale_ub_tensor = None
2025-05-07T20:32:33.3889272Z     
2025-05-07T20:32:33.3889654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.3890195Z             op = silu_mul_quant
2025-05-07T20:32:33.3890624Z             if compiled:
2025-05-07T20:32:33.3891046Z                 op = torch.compile(op)
2025-05-07T20:32:33.3891567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3892156Z     
2025-05-07T20:32:33.3892481Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.3892763Z 
2025-05-07T20:32:33.3892924Z moe/activation_test.py:117: 
2025-05-07T20:32:33.3893431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3894011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.3894490Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3895777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.3897081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.3898069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.3899343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.3900566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.3901338Z     kernel = self.compile(
2025-05-07T20:32:33.3902085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.3903008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.3903563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3903889Z 
2025-05-07T20:32:33.3904177Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5a660>
2025-05-07T20:32:33.3905848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.3908351Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b46340>}
2025-05-07T20:32:33.3910410Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.3911935Z context = <triton._C.libtriton.ir.context object at 0x7f89178e7df0>
2025-05-07T20:32:33.3912421Z 
2025-05-07T20:32:33.3912698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.3913497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.3914271Z                            module_map=module_map)
2025-05-07T20:32:33.3914863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.3915573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.3915992Z E       ^
2025-05-07T20:32:33.3916758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.3917653Z 
2025-05-07T20:32:33.3918439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.3919407Z 
2025-05-07T20:32:33.3919578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3920308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3921138Z     T=16384,
2025-05-07T20:32:33.3921448Z     D=5120,
2025-05-07T20:32:33.3921761Z     scale_ub=1200.0,
2025-05-07T20:32:33.3922131Z     contiguous=True,
2025-05-07T20:32:33.3922497Z     compiled=True,
2025-05-07T20:32:33.3922826Z )
2025-05-07T20:32:33.3923373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.3924270Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.3924768Z 
2025-05-07T20:32:33.3924893Z     @given(
2025-05-07T20:32:33.3925275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.3925818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.3926339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.3926909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.3927478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.3927971Z     )
2025-05-07T20:32:33.3928574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.3929374Z     def test_silu_mul_quant(
2025-05-07T20:32:33.3929780Z         self,
2025-05-07T20:32:33.3930091Z         T: int,
2025-05-07T20:32:33.3930410Z         D: int,
2025-05-07T20:32:33.3930771Z         scale_ub: Optional[float],
2025-05-07T20:32:33.3931227Z         contiguous: bool,
2025-05-07T20:32:33.3931627Z         compiled: bool,
2025-05-07T20:32:33.3932128Z     ) -> None:
2025-05-07T20:32:33.3932474Z         torch.manual_seed(2025)
2025-05-07T20:32:33.3932883Z     
2025-05-07T20:32:33.3933337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.3933930Z     
2025-05-07T20:32:33.3934243Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.3934733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.3935265Z         x = x_sign * x_clamp
2025-05-07T20:32:33.3935660Z         x0 = x[:, :D]
2025-05-07T20:32:33.3936014Z         x1 = x[:, D:]
2025-05-07T20:32:33.3936364Z     
2025-05-07T20:32:33.3936655Z         if contiguous:
2025-05-07T20:32:33.3937038Z             x0 = x0.contiguous()
2025-05-07T20:32:33.3937473Z             x1 = x1.contiguous()
2025-05-07T20:32:33.3937899Z     
2025-05-07T20:32:33.3938330Z         if scale_ub is not None:
2025-05-07T20:32:33.3938800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.3939381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.3939898Z             )
2025-05-07T20:32:33.3940219Z         else:
2025-05-07T20:32:33.3940561Z             scale_ub_tensor = None
2025-05-07T20:32:33.3940984Z     
2025-05-07T20:32:33.3941361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.3941906Z             op = silu_mul_quant
2025-05-07T20:32:33.3942314Z             if compiled:
2025-05-07T20:32:33.3942723Z                 op = torch.compile(op)
2025-05-07T20:32:33.3943222Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3943686Z     
2025-05-07T20:32:33.3943999Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.3944278Z 
2025-05-07T20:32:33.3944447Z moe/activation_test.py:117: 
2025-05-07T20:32:33.3944939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3945516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.3945999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3947084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.3948108Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.3949315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.3950695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.3951678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.3952933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.3954215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.3955200Z     kernel = self.compile(
2025-05-07T20:32:33.3956182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.3957393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.3958086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3958505Z 
2025-05-07T20:32:33.3958858Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5a1b0>
2025-05-07T20:32:33.3960884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.3963528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b479c0>}
2025-05-07T20:32:33.3966091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.3968031Z context = <triton._C.libtriton.ir.context object at 0x7f8ac856de70>
2025-05-07T20:32:33.3968563Z 
2025-05-07T20:32:33.3968846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.3969787Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.3970620Z                            module_map=module_map)
2025-05-07T20:32:33.3971249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.3971945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.3972394Z E       ^
2025-05-07T20:32:33.3973221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.3974082Z 
2025-05-07T20:32:33.3974927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.3975902Z 
2025-05-07T20:32:33.5672070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5672859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5673546Z     T=16384,
2025-05-07T20:32:33.5673842Z     D=5120,
2025-05-07T20:32:33.5674139Z     scale_ub=None,
2025-05-07T20:32:33.5674481Z     contiguous=False,
2025-05-07T20:32:33.5674822Z     compiled=True,
2025-05-07T20:32:33.5675142Z )
2025-05-07T20:32:33.5675656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5676474Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.5676956Z 
2025-05-07T20:32:33.5677066Z     @given(
2025-05-07T20:32:33.5677404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5677894Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5678402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5678942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5679775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5680240Z     )
2025-05-07T20:32:33.5680821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5681579Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5682086Z         self,
2025-05-07T20:32:33.5682385Z         T: int,
2025-05-07T20:32:33.5682690Z         D: int,
2025-05-07T20:32:33.5683017Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5683457Z         contiguous: bool,
2025-05-07T20:32:33.5683849Z         compiled: bool,
2025-05-07T20:32:33.5684215Z     ) -> None:
2025-05-07T20:32:33.5684684Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5685072Z     
2025-05-07T20:32:33.5685494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5686057Z     
2025-05-07T20:32:33.5686363Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5686813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5687321Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5687703Z         x0 = x[:, :D]
2025-05-07T20:32:33.5688027Z         x1 = x[:, D:]
2025-05-07T20:32:33.5688352Z     
2025-05-07T20:32:33.5688632Z         if contiguous:
2025-05-07T20:32:33.5688988Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5689388Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5689767Z     
2025-05-07T20:32:33.5690065Z         if scale_ub is not None:
2025-05-07T20:32:33.5690494Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5691062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5691581Z             )
2025-05-07T20:32:33.5691974Z         else:
2025-05-07T20:32:33.5692305Z             scale_ub_tensor = None
2025-05-07T20:32:33.5692670Z     
2025-05-07T20:32:33.5692972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5693412Z             op = silu_mul_quant
2025-05-07T20:32:33.5693754Z             if compiled:
2025-05-07T20:32:33.5694076Z                 op = torch.compile(op)
2025-05-07T20:32:33.5694485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5694863Z     
2025-05-07T20:32:33.5695117Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5695375Z 
2025-05-07T20:32:33.5695514Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5695943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5696439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5696854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5697768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.5698635Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.5699838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5700948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5701855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5702991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5704193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5705165Z     kernel = self.compile(
2025-05-07T20:32:33.5706446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5707646Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5708337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5708761Z 
2025-05-07T20:32:33.5709104Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8510aa0>
2025-05-07T20:32:33.5711046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5713522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7cc20>}
2025-05-07T20:32:33.5715944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5717730Z context = <triton._C.libtriton.ir.context object at 0x7f8917c000f0>
2025-05-07T20:32:33.5718365Z 
2025-05-07T20:32:33.5718633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5719503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5720332Z                            module_map=module_map)
2025-05-07T20:32:33.5720904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5721439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5721807Z E       ^
2025-05-07T20:32:33.5722595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5723455Z 
2025-05-07T20:32:33.5724233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5725213Z 
2025-05-07T20:32:33.5725369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5726088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5726803Z     T=2048,
2025-05-07T20:32:33.5727110Z     D=5120,
2025-05-07T20:32:33.5727420Z     scale_ub=None,
2025-05-07T20:32:33.5727771Z     contiguous=False,
2025-05-07T20:32:33.5728148Z     compiled=True,
2025-05-07T20:32:33.5728482Z )
2025-05-07T20:32:33.6629444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6630462Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.6630960Z 
2025-05-07T20:32:33.6631088Z     @given(
2025-05-07T20:32:33.6631463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6632021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6632528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6633098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6633671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6634166Z     )
2025-05-07T20:32:33.6634770Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6635557Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6635964Z         self,
2025-05-07T20:32:33.6636598Z         T: int,
2025-05-07T20:32:33.6636929Z         D: int,
2025-05-07T20:32:33.6637294Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6637743Z         contiguous: bool,
2025-05-07T20:32:33.6638153Z         compiled: bool,
2025-05-07T20:32:33.6638527Z     ) -> None:
2025-05-07T20:32:33.6638873Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6639284Z     
2025-05-07T20:32:33.6639742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6640330Z     
2025-05-07T20:32:33.6640633Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6641095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6641585Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6641956Z         x0 = x[:, :D]
2025-05-07T20:32:33.6642298Z         x1 = x[:, D:]
2025-05-07T20:32:33.6642626Z     
2025-05-07T20:32:33.6642908Z         if contiguous:
2025-05-07T20:32:33.6643271Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6643704Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6644108Z     
2025-05-07T20:32:33.6644420Z         if scale_ub is not None:
2025-05-07T20:32:33.6644876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6645571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6646106Z             )
2025-05-07T20:32:33.6646417Z         else:
2025-05-07T20:32:33.6646865Z             scale_ub_tensor = None
2025-05-07T20:32:33.6647408Z     
2025-05-07T20:32:33.6647783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6648313Z             op = silu_mul_quant
2025-05-07T20:32:33.6648733Z             if compiled:
2025-05-07T20:32:33.6649142Z                 op = torch.compile(op)
2025-05-07T20:32:33.6649632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6650233Z     
2025-05-07T20:32:33.6650546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6650827Z 
2025-05-07T20:32:33.6651000Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6651504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6652195Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6652693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6653713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6654752Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6655980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6657269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6658244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6659518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6660756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6661738Z     kernel = self.compile(
2025-05-07T20:32:33.6662692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6663646Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6664189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6664515Z 
2025-05-07T20:32:33.6664788Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8510ef0>
2025-05-07T20:32:33.6666309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6668380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7d9e0>}
2025-05-07T20:32:33.6670541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6672105Z context = <triton._C.libtriton.ir.context object at 0x7f8917c5a6f0>
2025-05-07T20:32:33.6672521Z 
2025-05-07T20:32:33.6672762Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6673537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6674264Z                            module_map=module_map)
2025-05-07T20:32:33.6685329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6685935Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6686341Z E       ^
2025-05-07T20:32:33.6687108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6687881Z 
2025-05-07T20:32:33.6688528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6689322Z 
2025-05-07T20:32:33.6689590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.6690264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.6690922Z     T=2048,
2025-05-07T20:32:33.6691280Z     D=5120,
2025-05-07T20:32:33.6691575Z     scale_ub=1200.0,
2025-05-07T20:32:33.6692076Z     contiguous=False,
2025-05-07T20:32:33.6692414Z     compiled=True,
2025-05-07T20:32:33.6692719Z )
2025-05-07T20:32:33.6693232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6694104Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.6694701Z 
2025-05-07T20:32:33.6694825Z     @given(
2025-05-07T20:32:33.6695200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6695751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6696267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6696852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6697412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6697899Z     )
2025-05-07T20:32:33.6698510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6699307Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6699705Z         self,
2025-05-07T20:32:33.6700025Z         T: int,
2025-05-07T20:32:33.6700352Z         D: int,
2025-05-07T20:32:33.6700704Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6701163Z         contiguous: bool,
2025-05-07T20:32:33.6701563Z         compiled: bool,
2025-05-07T20:32:33.6701942Z     ) -> None:
2025-05-07T20:32:33.6702286Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6702693Z     
2025-05-07T20:32:33.6703146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6703744Z     
2025-05-07T20:32:33.6704061Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6704551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6705082Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6705490Z         x0 = x[:, :D]
2025-05-07T20:32:33.6705846Z         x1 = x[:, D:]
2025-05-07T20:32:33.6706535Z     
2025-05-07T20:32:33.6706859Z         if contiguous:
2025-05-07T20:32:33.6707259Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6707687Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6708095Z     
2025-05-07T20:32:33.6708405Z         if scale_ub is not None:
2025-05-07T20:32:33.6708856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6709433Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6709961Z             )
2025-05-07T20:32:33.6710280Z         else:
2025-05-07T20:32:33.6710621Z             scale_ub_tensor = None
2025-05-07T20:32:33.6711045Z     
2025-05-07T20:32:33.6711558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6712099Z             op = silu_mul_quant
2025-05-07T20:32:33.6712522Z             if compiled:
2025-05-07T20:32:33.6712940Z                 op = torch.compile(op)
2025-05-07T20:32:33.6713435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6713910Z     
2025-05-07T20:32:33.6714223Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6714506Z 
2025-05-07T20:32:33.6714668Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6715168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6715748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6716216Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6717241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6718295Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6719531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6720816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6721892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6723210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6724537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6725510Z     kernel = self.compile(
2025-05-07T20:32:33.6726493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6727807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6728499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6728924Z 
2025-05-07T20:32:33.6729283Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac85139e0>
2025-05-07T20:32:33.6731332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6734079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7eb60>}
2025-05-07T20:32:33.6736657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6738587Z context = <triton._C.libtriton.ir.context object at 0x7f891782e530>
2025-05-07T20:32:33.6739117Z 
2025-05-07T20:32:33.6739399Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6740335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6741185Z                            module_map=module_map)
2025-05-07T20:32:33.6741808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6742412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6742857Z E       ^
2025-05-07T20:32:33.6743679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6744544Z 
2025-05-07T20:32:33.6745317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6746298Z 
2025-05-07T20:32:33.8482455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.8483279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.8483706Z     T=4096,
2025-05-07T20:32:33.8483894Z     D=5120,
2025-05-07T20:32:33.8484363Z     scale_ub=1200.0,
2025-05-07T20:32:33.8484598Z     contiguous=True,
2025-05-07T20:32:33.8484815Z     compiled=True,
2025-05-07T20:32:33.8485028Z )
2025-05-07T20:32:33.8485363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.8485871Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.8486162Z 
2025-05-07T20:32:33.8486247Z     @given(
2025-05-07T20:32:33.8486483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.8486796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.8487115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.8487453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.8487797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.8488084Z     )
2025-05-07T20:32:33.8488444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.8488896Z     def test_silu_mul_quant(
2025-05-07T20:32:33.8489142Z         self,
2025-05-07T20:32:33.8489347Z         T: int,
2025-05-07T20:32:33.8489548Z         D: int,
2025-05-07T20:32:33.8489861Z         scale_ub: Optional[float],
2025-05-07T20:32:33.8490147Z         contiguous: bool,
2025-05-07T20:32:33.8490392Z         compiled: bool,
2025-05-07T20:32:33.8490622Z     ) -> None:
2025-05-07T20:32:33.8490845Z         torch.manual_seed(2025)
2025-05-07T20:32:33.8491172Z     
2025-05-07T20:32:33.8491445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.8491795Z     
2025-05-07T20:32:33.8492080Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.8492420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.8492741Z         x = x_sign * x_clamp
2025-05-07T20:32:33.8493075Z         x0 = x[:, :D]
2025-05-07T20:32:33.8493299Z         x1 = x[:, D:]
2025-05-07T20:32:33.8493508Z     
2025-05-07T20:32:33.8493697Z         if contiguous:
2025-05-07T20:32:33.8493936Z             x0 = x0.contiguous()
2025-05-07T20:32:33.8494194Z             x1 = x1.contiguous()
2025-05-07T20:32:33.8494437Z     
2025-05-07T20:32:33.8494628Z         if scale_ub is not None:
2025-05-07T20:32:33.8494903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.8495244Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.8495563Z             )
2025-05-07T20:32:33.8495753Z         else:
2025-05-07T20:32:33.8495971Z             scale_ub_tensor = None
2025-05-07T20:32:33.8496229Z     
2025-05-07T20:32:33.8496459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.8496782Z             op = silu_mul_quant
2025-05-07T20:32:33.8497038Z             if compiled:
2025-05-07T20:32:33.8497284Z                 op = torch.compile(op)
2025-05-07T20:32:33.8497588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.8497869Z     
2025-05-07T20:32:33.8498070Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.8498236Z 
2025-05-07T20:32:33.8498339Z moe/activation_test.py:117: 
2025-05-07T20:32:33.8498642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8498984Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.8499268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.8499843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.8500424Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.8501097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.8501806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.8502358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.8503060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.8503789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.8504339Z     kernel = self.compile(
2025-05-07T20:32:33.8504901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.8505576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.8505976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8506767Z 
2025-05-07T20:32:33.8507011Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c8710>
2025-05-07T20:32:33.8508140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.8509596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917880180>}
2025-05-07T20:32:33.8511114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.8512183Z context = <triton._C.libtriton.ir.context object at 0x7f89178077f0>
2025-05-07T20:32:33.8512549Z 
2025-05-07T20:32:33.8512719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.8513259Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.8513731Z                            module_map=module_map)
2025-05-07T20:32:33.8514104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.8514536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.8514801Z E       ^
2025-05-07T20:32:33.8515283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.8515756Z 
2025-05-07T20:32:33.8516201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.8516729Z 
2025-05-07T20:32:33.8516842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.8517268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.8517684Z     T=128,
2025-05-07T20:32:33.8517879Z     D=5120,
2025-05-07T20:32:33.8518075Z     scale_ub=1200.0,
2025-05-07T20:32:33.8518299Z     contiguous=False,
2025-05-07T20:32:33.8518528Z     compiled=True,
2025-05-07T20:32:33.8518734Z )
2025-05-07T20:32:34.1263978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1264595Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.1264876Z 
2025-05-07T20:32:34.1264961Z     @given(
2025-05-07T20:32:34.1265205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1265525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1265836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1266170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1266501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1266787Z     )
2025-05-07T20:32:34.1267140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1267609Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1267860Z         self,
2025-05-07T20:32:34.1268055Z         T: int,
2025-05-07T20:32:34.1268260Z         D: int,
2025-05-07T20:32:34.1268495Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1268777Z         contiguous: bool,
2025-05-07T20:32:34.1269027Z         compiled: bool,
2025-05-07T20:32:34.1269270Z     ) -> None:
2025-05-07T20:32:34.1269486Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1269741Z     
2025-05-07T20:32:34.1270356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1270716Z     
2025-05-07T20:32:34.1270911Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1271219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1271538Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1271790Z         x0 = x[:, :D]
2025-05-07T20:32:34.1272024Z         x1 = x[:, D:]
2025-05-07T20:32:34.1272245Z     
2025-05-07T20:32:34.1272432Z         if contiguous:
2025-05-07T20:32:34.1272674Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1272937Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1273178Z     
2025-05-07T20:32:34.1273374Z         if scale_ub is not None:
2025-05-07T20:32:34.1273656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1274001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1274322Z             )
2025-05-07T20:32:34.1274522Z         else:
2025-05-07T20:32:34.1274735Z             scale_ub_tensor = None
2025-05-07T20:32:34.1274995Z     
2025-05-07T20:32:34.1275232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1275560Z             op = silu_mul_quant
2025-05-07T20:32:34.1275902Z             if compiled:
2025-05-07T20:32:34.1276160Z                 op = torch.compile(op)
2025-05-07T20:32:34.1276463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1276818Z     
2025-05-07T20:32:34.1277015Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1277184Z 
2025-05-07T20:32:34.1277293Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1277595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1277942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1278313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1278892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.1279481Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.1280175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1280895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1281440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1282158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1282890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1283447Z     kernel = self.compile(
2025-05-07T20:32:34.1284005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1284692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1285108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1285345Z 
2025-05-07T20:32:34.1285556Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c8740>
2025-05-07T20:32:34.1286678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1288133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917880ea0>}
2025-05-07T20:32:34.1289525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1290599Z context = <triton._C.libtriton.ir.context object at 0x7f8917801530>
2025-05-07T20:32:34.1290892Z 
2025-05-07T20:32:34.1291113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1291652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1292208Z                            module_map=module_map)
2025-05-07T20:32:34.1292586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1292944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1293217Z E       ^
2025-05-07T20:32:34.1293694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1294158Z 
2025-05-07T20:32:34.1294586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1295125Z 
2025-05-07T20:32:34.1295230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1295659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1296073Z     T=16384,
2025-05-07T20:32:34.1296265Z     D=7168,
2025-05-07T20:32:34.1296463Z     scale_ub=1200.0,
2025-05-07T20:32:34.1296690Z     contiguous=True,
2025-05-07T20:32:34.1296906Z     compiled=True,
2025-05-07T20:32:34.1297171Z )
2025-05-07T20:32:34.1297500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1298008Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.1298340Z 
2025-05-07T20:32:34.1298419Z     @given(
2025-05-07T20:32:34.1298657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1298971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1299282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1299619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1300004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1300288Z     )
2025-05-07T20:32:34.1300648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1301107Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1301349Z         self,
2025-05-07T20:32:34.1301554Z         T: int,
2025-05-07T20:32:34.1301758Z         D: int,
2025-05-07T20:32:34.1301974Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1302252Z         contiguous: bool,
2025-05-07T20:32:34.1302497Z         compiled: bool,
2025-05-07T20:32:34.1302722Z     ) -> None:
2025-05-07T20:32:34.1302951Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1303227Z     
2025-05-07T20:32:34.1303516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1303873Z     
2025-05-07T20:32:34.1304075Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1304377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1304700Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1304949Z         x0 = x[:, :D]
2025-05-07T20:32:34.1305177Z         x1 = x[:, D:]
2025-05-07T20:32:34.1305388Z     
2025-05-07T20:32:34.1305585Z         if contiguous:
2025-05-07T20:32:34.1305823Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1306082Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1306591Z     
2025-05-07T20:32:34.1306791Z         if scale_ub is not None:
2025-05-07T20:32:34.1307064Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1307407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1307737Z             )
2025-05-07T20:32:34.1307937Z         else:
2025-05-07T20:32:34.1308162Z             scale_ub_tensor = None
2025-05-07T20:32:34.1308426Z     
2025-05-07T20:32:34.1308658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1308985Z             op = silu_mul_quant
2025-05-07T20:32:34.1309247Z             if compiled:
2025-05-07T20:32:34.1309499Z                 op = torch.compile(op)
2025-05-07T20:32:34.1309805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1310089Z     
2025-05-07T20:32:34.1310290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1310539Z 
2025-05-07T20:32:34.1310643Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1310948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1311291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1311575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1312153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.1312738Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.1313419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1314127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1314685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1315397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1316079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1316637Z     kernel = self.compile(
2025-05-07T20:32:34.1317310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1317994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1318506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1318757Z 
2025-05-07T20:32:34.1318973Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c9220>
2025-05-07T20:32:34.1320098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1321584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89178820c0>}
2025-05-07T20:32:34.1323027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1324095Z context = <triton._C.libtriton.ir.context object at 0x7f89177e85b0>
2025-05-07T20:32:34.1324399Z 
2025-05-07T20:32:34.1324570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1325115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1325596Z                            module_map=module_map)
2025-05-07T20:32:34.1325979Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1326347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1326618Z E       ^
2025-05-07T20:32:34.1327097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1327570Z 
2025-05-07T20:32:34.1328003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1328532Z 
2025-05-07T20:32:34.2551239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2551718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2552134Z     T=16384,
2025-05-07T20:32:34.2552337Z     D=5120,
2025-05-07T20:32:34.2552564Z     scale_ub=1200.0,
2025-05-07T20:32:34.2552954Z     contiguous=True,
2025-05-07T20:32:34.2553384Z     compiled=False,
2025-05-07T20:32:34.2553804Z )
2025-05-07T20:32:34.2554452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2555473Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.2556049Z 
2025-05-07T20:32:34.2556549Z     @given(
2025-05-07T20:32:34.2557020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2557645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2558256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2558918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2559575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2560145Z     )
2025-05-07T20:32:34.2560857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2561753Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2562239Z         self,
2025-05-07T20:32:34.2562614Z         T: int,
2025-05-07T20:32:34.2562943Z         D: int,
2025-05-07T20:32:34.2563205Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2563486Z         contiguous: bool,
2025-05-07T20:32:34.2563732Z         compiled: bool,
2025-05-07T20:32:34.2563979Z     ) -> None:
2025-05-07T20:32:34.2564206Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2564457Z     
2025-05-07T20:32:34.2564740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2565086Z     
2025-05-07T20:32:34.2565362Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2565662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2574460Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2574843Z         x0 = x[:, :D]
2025-05-07T20:32:34.2575070Z         x1 = x[:, D:]
2025-05-07T20:32:34.2575285Z     
2025-05-07T20:32:34.2575469Z         if contiguous:
2025-05-07T20:32:34.2575712Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2575980Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2576224Z     
2025-05-07T20:32:34.2576415Z         if scale_ub is not None:
2025-05-07T20:32:34.2576785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2577135Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2577447Z             )
2025-05-07T20:32:34.2577659Z         else:
2025-05-07T20:32:34.2577879Z             scale_ub_tensor = None
2025-05-07T20:32:34.2578127Z     
2025-05-07T20:32:34.2578369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2578695Z             op = silu_mul_quant
2025-05-07T20:32:34.2578950Z             if compiled:
2025-05-07T20:32:34.2579213Z                 op = torch.compile(op)
2025-05-07T20:32:34.2579524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2579802Z     
2025-05-07T20:32:34.2580007Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.2580175Z 
2025-05-07T20:32:34.2580289Z moe/activation_test.py:117: 
2025-05-07T20:32:34.2580598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2580940Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.2581232Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2581950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.2582711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.2583274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2583979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2584663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2585212Z     kernel = self.compile(
2025-05-07T20:32:34.2585772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2586451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2586858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2587098Z 
2025-05-07T20:32:34.2587311Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5b860>
2025-05-07T20:32:34.2588483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2589926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917881a80>}
2025-05-07T20:32:34.2591321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2592375Z context = <triton._C.libtriton.ir.context object at 0x7f89177f2230>
2025-05-07T20:32:34.2592681Z 
2025-05-07T20:32:34.2592851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2593395Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2593879Z                            module_map=module_map)
2025-05-07T20:32:34.2594297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2594661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2594929Z E       ^
2025-05-07T20:32:34.2595402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2595915Z 
2025-05-07T20:32:34.2596344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2596880Z 
2025-05-07T20:32:34.2596990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2597457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2597867Z     T=1,
2025-05-07T20:32:34.2598055Z     D=7168,
2025-05-07T20:32:34.2598258Z     scale_ub=1200.0,
2025-05-07T20:32:34.2598487Z     contiguous=False,
2025-05-07T20:32:34.2598719Z     compiled=False,
2025-05-07T20:32:34.2598931Z )
2025-05-07T20:32:34.2599255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2599764Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.2600040Z 
2025-05-07T20:32:34.2600128Z     @given(
2025-05-07T20:32:34.2600360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2600689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2601004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2601345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2601681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2601982Z     )
2025-05-07T20:32:34.2602344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2602804Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2603056Z         self,
2025-05-07T20:32:34.2603261Z         T: int,
2025-05-07T20:32:34.2603461Z         D: int,
2025-05-07T20:32:34.2603688Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2603973Z         contiguous: bool,
2025-05-07T20:32:34.2604221Z         compiled: bool,
2025-05-07T20:32:34.2604452Z     ) -> None:
2025-05-07T20:32:34.2604677Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2604923Z     
2025-05-07T20:32:34.2605211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2605565Z     
2025-05-07T20:32:34.2605765Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2606059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2606689Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2606940Z         x0 = x[:, :D]
2025-05-07T20:32:34.2607162Z         x1 = x[:, D:]
2025-05-07T20:32:34.2607375Z     
2025-05-07T20:32:34.2607566Z         if contiguous:
2025-05-07T20:32:34.2607796Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2608143Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2608387Z     
2025-05-07T20:32:34.2608575Z         if scale_ub is not None:
2025-05-07T20:32:34.2608851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2609193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2609501Z             )
2025-05-07T20:32:34.2609701Z         else:
2025-05-07T20:32:34.2609917Z             scale_ub_tensor = None
2025-05-07T20:32:34.2610169Z     
2025-05-07T20:32:34.2610408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2610731Z             op = silu_mul_quant
2025-05-07T20:32:34.2610990Z             if compiled:
2025-05-07T20:32:34.2611240Z                 op = torch.compile(op)
2025-05-07T20:32:34.2611541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2611902Z     
2025-05-07T20:32:34.2612098Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.2612270Z 
2025-05-07T20:32:34.2612371Z moe/activation_test.py:117: 
2025-05-07T20:32:34.2612677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2613013Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.2613374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2614089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.2614801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.2615408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2616114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2616799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2617407Z     kernel = self.compile(
2025-05-07T20:32:34.2617964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2618645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2619058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2619294Z 
2025-05-07T20:32:34.2619505Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177560c0>
2025-05-07T20:32:34.2620625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2622051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177080e0>}
2025-05-07T20:32:34.2623442Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2624503Z context = <triton._C.libtriton.ir.context object at 0x7f89179e8770>
2025-05-07T20:32:34.2624798Z 
2025-05-07T20:32:34.2624972Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2625512Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2625997Z                            module_map=module_map)
2025-05-07T20:32:34.2626363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2626728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2626994Z E       ^
2025-05-07T20:32:34.2627466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2627942Z 
2025-05-07T20:32:34.2628370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2628906Z 
2025-05-07T20:32:34.4338764Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4339234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4339662Z     T=4096,
2025-05-07T20:32:34.4339844Z     D=7168,
2025-05-07T20:32:34.4340037Z     scale_ub=1200.0,
2025-05-07T20:32:34.4340260Z     contiguous=False,
2025-05-07T20:32:34.4340476Z     compiled=True,
2025-05-07T20:32:34.4340689Z )
2025-05-07T20:32:34.4341016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4341523Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.4341811Z 
2025-05-07T20:32:34.4341892Z     @given(
2025-05-07T20:32:34.4342126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4342451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4342758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4343100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4343438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4343731Z     )
2025-05-07T20:32:34.4344171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4344640Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4344886Z         self,
2025-05-07T20:32:34.4345078Z         T: int,
2025-05-07T20:32:34.4345361Z         D: int,
2025-05-07T20:32:34.4345583Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4345854Z         contiguous: bool,
2025-05-07T20:32:34.4346101Z         compiled: bool,
2025-05-07T20:32:34.4346332Z     ) -> None:
2025-05-07T20:32:34.4346546Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4346792Z     
2025-05-07T20:32:34.4347150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4347495Z     
2025-05-07T20:32:34.4347694Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4347993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4348318Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4348560Z         x0 = x[:, :D]
2025-05-07T20:32:34.4348784Z         x1 = x[:, D:]
2025-05-07T20:32:34.4348996Z     
2025-05-07T20:32:34.4349184Z         if contiguous:
2025-05-07T20:32:34.4349422Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4349689Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4349930Z     
2025-05-07T20:32:34.4350129Z         if scale_ub is not None:
2025-05-07T20:32:34.4350405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4350741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4351058Z             )
2025-05-07T20:32:34.4351262Z         else:
2025-05-07T20:32:34.4351472Z             scale_ub_tensor = None
2025-05-07T20:32:34.4351736Z     
2025-05-07T20:32:34.4351971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4352313Z             op = silu_mul_quant
2025-05-07T20:32:34.4352592Z             if compiled:
2025-05-07T20:32:34.4352845Z                 op = torch.compile(op)
2025-05-07T20:32:34.4353151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4353425Z     
2025-05-07T20:32:34.4353622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4353808Z 
2025-05-07T20:32:34.4353909Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4354214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4354550Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4354840Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4355419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.4355992Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.4356674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4357384Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4358019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4358718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4359407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4359958Z     kernel = self.compile(
2025-05-07T20:32:34.4360508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4361184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4361595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4361833Z 
2025-05-07T20:32:34.4362051Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917755a30>
2025-05-07T20:32:34.4363164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4364647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917709300>}
2025-05-07T20:32:34.4366038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4367138Z context = <triton._C.libtriton.ir.context object at 0x7f891791d530>
2025-05-07T20:32:34.4367433Z 
2025-05-07T20:32:34.4367609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4368180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4368658Z                            module_map=module_map)
2025-05-07T20:32:34.4369031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4369386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4369652Z E       ^
2025-05-07T20:32:34.4370133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4370595Z 
2025-05-07T20:32:34.4371030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4371560Z 
2025-05-07T20:32:34.4371663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4372183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4372597Z     T=128,
2025-05-07T20:32:34.4372785Z     D=7168,
2025-05-07T20:32:34.4372980Z     scale_ub=1200.0,
2025-05-07T20:32:34.4373208Z     contiguous=False,
2025-05-07T20:32:34.4373434Z     compiled=True,
2025-05-07T20:32:34.4373644Z )
2025-05-07T20:32:34.5280631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5281724Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.5282297Z 
2025-05-07T20:32:34.5282468Z     @given(
2025-05-07T20:32:34.5282915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5283309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5283627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5283957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5284299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5284596Z     )
2025-05-07T20:32:34.5284952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5285419Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5285670Z         self,
2025-05-07T20:32:34.5285870Z         T: int,
2025-05-07T20:32:34.5286074Z         D: int,
2025-05-07T20:32:34.5286527Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5286808Z         contiguous: bool,
2025-05-07T20:32:34.5287054Z         compiled: bool,
2025-05-07T20:32:34.5287292Z     ) -> None:
2025-05-07T20:32:34.5287515Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5287757Z     
2025-05-07T20:32:34.5288038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5288394Z     
2025-05-07T20:32:34.5288596Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5288896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5289214Z         x = x_sign * x_clamp
2025-05-07T20:32:34.5289453Z         x0 = x[:, :D]
2025-05-07T20:32:34.5289677Z         x1 = x[:, D:]
2025-05-07T20:32:34.5289892Z     
2025-05-07T20:32:34.5290080Z         if contiguous:
2025-05-07T20:32:34.5290324Z             x0 = x0.contiguous()
2025-05-07T20:32:34.5290591Z             x1 = x1.contiguous()
2025-05-07T20:32:34.5290831Z     
2025-05-07T20:32:34.5291029Z         if scale_ub is not None:
2025-05-07T20:32:34.5291312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.5291654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.5292051Z             )
2025-05-07T20:32:34.5292340Z         else:
2025-05-07T20:32:34.5292592Z             scale_ub_tensor = None
2025-05-07T20:32:34.5292862Z     
2025-05-07T20:32:34.5293102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5293498Z             op = silu_mul_quant
2025-05-07T20:32:34.5293750Z             if compiled:
2025-05-07T20:32:34.5294005Z                 op = torch.compile(op)
2025-05-07T20:32:34.5294313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5294590Z     
2025-05-07T20:32:34.5294789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.5295034Z 
2025-05-07T20:32:34.5295143Z moe/activation_test.py:117: 
2025-05-07T20:32:34.5295447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5295799Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.5296094Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5296682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.5297265Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.5297954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.5298674Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.5299227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.5299938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.5300632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.5301187Z     kernel = self.compile(
2025-05-07T20:32:34.5301750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.5302439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.5302858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5303097Z 
2025-05-07T20:32:34.5303317Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917756540>
2025-05-07T20:32:34.5304439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.5305884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891770a020>}
2025-05-07T20:32:34.5307610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.5308684Z context = <triton._C.libtriton.ir.context object at 0x7f89175c03b0>
2025-05-07T20:32:34.5308983Z 
2025-05-07T20:32:34.5309154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.5309699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.5310185Z                            module_map=module_map)
2025-05-07T20:32:34.5310561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.5310920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.5311186Z E       ^
2025-05-07T20:32:34.5311666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5312137Z 
2025-05-07T20:32:34.5312573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.5313113Z 
2025-05-07T20:32:34.5313220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5313713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5314135Z     T=2048,
2025-05-07T20:32:34.5314324Z     D=7168,
2025-05-07T20:32:34.5314522Z     scale_ub=None,
2025-05-07T20:32:34.5314745Z     contiguous=True,
2025-05-07T20:32:34.5315032Z     compiled=True,
2025-05-07T20:32:34.5315242Z )
2025-05-07T20:32:34.5315574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5316082Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.5316367Z 
2025-05-07T20:32:34.5316511Z     @given(
2025-05-07T20:32:34.5316751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5317076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5317388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5317731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5318073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5318362Z     )
2025-05-07T20:32:34.5318733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5319198Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5319446Z         self,
2025-05-07T20:32:34.5319655Z         T: int,
2025-05-07T20:32:34.5319862Z         D: int,
2025-05-07T20:32:34.5320083Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5320373Z         contiguous: bool,
2025-05-07T20:32:34.5320624Z         compiled: bool,
2025-05-07T20:32:34.5320851Z     ) -> None:
2025-05-07T20:32:34.5321079Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5321334Z     
2025-05-07T20:32:34.5321611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5321969Z     
2025-05-07T20:32:34.5322170Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5322478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5322802Z         x = x_sign * x_clamp
2025-05-07T20:32:34.5323051Z         x0 = x[:, :D]
2025-05-07T20:32:34.5323284Z         x1 = x[:, D:]
2025-05-07T20:32:34.5323496Z     
2025-05-07T20:32:34.5323688Z         if contiguous:
2025-05-07T20:32:34.5323923Z             x0 = x0.contiguous()
2025-05-07T20:32:34.5324186Z             x1 = x1.contiguous()
2025-05-07T20:32:34.5324434Z     
2025-05-07T20:32:34.5324631Z         if scale_ub is not None:
2025-05-07T20:32:34.5324909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.5325254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.5325574Z             )
2025-05-07T20:32:34.5325771Z         else:
2025-05-07T20:32:34.5325989Z             scale_ub_tensor = None
2025-05-07T20:32:34.5326247Z     
2025-05-07T20:32:34.5326485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5326812Z             op = silu_mul_quant
2025-05-07T20:32:34.5327124Z             if compiled:
2025-05-07T20:32:34.5327381Z                 op = torch.compile(op)
2025-05-07T20:32:34.5327686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5327969Z     
2025-05-07T20:32:34.5328166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.5328334Z 
2025-05-07T20:32:34.5328435Z moe/activation_test.py:117: 
2025-05-07T20:32:34.5328747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5329095Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.5329381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5329964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.5330548Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.5331235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.5332012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.5332583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.5333389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.5334079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.5334671Z     kernel = self.compile(
2025-05-07T20:32:34.5335234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.5335928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.5336339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5336655Z 
2025-05-07T20:32:34.5336874Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175d8320>
2025-05-07T20:32:34.5338005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.5339433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891770b240>}
2025-05-07T20:32:34.5340836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.5349784Z context = <triton._C.libtriton.ir.context object at 0x7f89175934b0>
2025-05-07T20:32:34.5350104Z 
2025-05-07T20:32:34.5350280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.5350828Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.5351312Z                            module_map=module_map)
2025-05-07T20:32:34.5351686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.5352058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.5352354Z E       ^
2025-05-07T20:32:34.5352864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5353335Z 
2025-05-07T20:32:34.5353776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.5354319Z 
2025-05-07T20:32:34.5981850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5982303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5982738Z     T=16384,
2025-05-07T20:32:34.5982934Z     D=5120,
2025-05-07T20:32:34.5983128Z     scale_ub=None,
2025-05-07T20:32:34.5983341Z     contiguous=False,
2025-05-07T20:32:34.5983560Z     compiled=False,
2025-05-07T20:32:34.5983999Z )
2025-05-07T20:32:34.5984332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5984855Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.5985148Z 
2025-05-07T20:32:34.5985228Z     @given(
2025-05-07T20:32:34.5985467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5985791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5986098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5986437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5986775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5987061Z     )
2025-05-07T20:32:34.5987426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5987880Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5988121Z         self,
2025-05-07T20:32:34.5988326Z         T: int,
2025-05-07T20:32:34.5988532Z         D: int,
2025-05-07T20:32:34.5988761Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5989033Z         contiguous: bool,
2025-05-07T20:32:34.5989282Z         compiled: bool,
2025-05-07T20:32:34.5989603Z     ) -> None:
2025-05-07T20:32:34.5989825Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5990078Z     
2025-05-07T20:32:34.5990358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5990771Z     
2025-05-07T20:32:34.5990970Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5991266Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5993374Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5995420Z 
2025-05-07T20:32:34.5995551Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.5995770Z 
2025-05-07T20:32:34.5995874Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5996301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5996718Z     T=4096,
2025-05-07T20:32:34.5996914Z     D=7168,
2025-05-07T20:32:34.5997108Z     scale_ub=1200.0,
2025-05-07T20:32:34.5997338Z     contiguous=True,
2025-05-07T20:32:34.5997567Z     compiled=True,
2025-05-07T20:32:34.5997775Z )
2025-05-07T20:32:34.5998110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5998624Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.5998905Z 
2025-05-07T20:32:34.5998992Z     @given(
2025-05-07T20:32:34.5999228Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5999553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5999867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6000226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6000571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6000872Z     )
2025-05-07T20:32:34.6001235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6001689Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6001940Z         self,
2025-05-07T20:32:34.6002139Z         T: int,
2025-05-07T20:32:34.6002341Z         D: int,
2025-05-07T20:32:34.6002570Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6002846Z         contiguous: bool,
2025-05-07T20:32:34.6003087Z         compiled: bool,
2025-05-07T20:32:34.6003393Z     ) -> None:
2025-05-07T20:32:34.6003656Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6003977Z     
2025-05-07T20:32:34.6004259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6004609Z     
2025-05-07T20:32:34.6004803Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6005099Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6007498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6009476Z 
2025-05-07T20:32:34.6009600Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.6009821Z 
2025-05-07T20:32:34.6009932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6010440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6010860Z     T=16384,
2025-05-07T20:32:34.6011059Z     D=7168,
2025-05-07T20:32:34.6011249Z     scale_ub=None,
2025-05-07T20:32:34.6011466Z     contiguous=False,
2025-05-07T20:32:34.6011755Z     compiled=False,
2025-05-07T20:32:34.6012049Z )
2025-05-07T20:32:34.6012375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6012891Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.6013179Z 
2025-05-07T20:32:34.6013265Z     @given(
2025-05-07T20:32:34.6013561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6013884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6014199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6014533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6014872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6015166Z     )
2025-05-07T20:32:34.6015521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6015984Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6016229Z         self,
2025-05-07T20:32:34.6016427Z         T: int,
2025-05-07T20:32:34.6016632Z         D: int,
2025-05-07T20:32:34.6016855Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6017132Z         contiguous: bool,
2025-05-07T20:32:34.6017370Z         compiled: bool,
2025-05-07T20:32:34.6017599Z     ) -> None:
2025-05-07T20:32:34.6017821Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6018063Z     
2025-05-07T20:32:34.6018345Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6020501Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6022494Z 
2025-05-07T20:32:34.6022637Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.6022854Z 
2025-05-07T20:32:34.6022966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6023385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6023803Z     T=2048,
2025-05-07T20:32:34.6023998Z     D=7168,
2025-05-07T20:32:34.6024188Z     scale_ub=1200.0,
2025-05-07T20:32:34.6024413Z     contiguous=True,
2025-05-07T20:32:34.6024638Z     compiled=True,
2025-05-07T20:32:34.6024839Z )
2025-05-07T20:32:34.6025238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6025753Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.6026031Z 
2025-05-07T20:32:34.6026112Z     @given(
2025-05-07T20:32:34.6026346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6026662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6026978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6027309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6027644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6027939Z     )
2025-05-07T20:32:34.6028292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6028750Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6028998Z         self,
2025-05-07T20:32:34.6029193Z         T: int,
2025-05-07T20:32:34.6029395Z         D: int,
2025-05-07T20:32:34.6029623Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6029894Z         contiguous: bool,
2025-05-07T20:32:34.6030141Z         compiled: bool,
2025-05-07T20:32:34.6030416Z     ) -> None:
2025-05-07T20:32:34.6030634Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6030882Z     
2025-05-07T20:32:34.6031161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6031545Z     
2025-05-07T20:32:34.6031743Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6032039Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6034118Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.6036091Z 
2025-05-07T20:32:34.6036223Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.6036436Z 
2025-05-07T20:32:34.6036542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6036965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6037386Z     T=2048,
2025-05-07T20:32:34.6037573Z     D=7168,
2025-05-07T20:32:34.6037770Z     scale_ub=None,
2025-05-07T20:32:34.6037991Z     contiguous=True,
2025-05-07T20:32:34.6038220Z     compiled=False,
2025-05-07T20:32:34.6038424Z )
2025-05-07T20:32:34.7159478Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7160604Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.7161163Z 
2025-05-07T20:32:34.7161332Z     @given(
2025-05-07T20:32:34.7161792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7162415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7163030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7163408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7163742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7164037Z     )
2025-05-07T20:32:34.7164402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7164848Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7165094Z         self,
2025-05-07T20:32:34.7165292Z         T: int,
2025-05-07T20:32:34.7165487Z         D: int,
2025-05-07T20:32:34.7165710Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7165989Z         contiguous: bool,
2025-05-07T20:32:34.7166229Z         compiled: bool,
2025-05-07T20:32:34.7166457Z     ) -> None:
2025-05-07T20:32:34.7166677Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7166919Z     
2025-05-07T20:32:34.7167440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7167796Z     
2025-05-07T20:32:34.7167993Z >       x_sign = torch.sign(x)
2025-05-07T20:32:34.7170038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.7172080Z 
2025-05-07T20:32:34.7172200Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:34.7172422Z 
2025-05-07T20:32:34.7172523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7172948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7173356Z     T=1,
2025-05-07T20:32:34.7173542Z     D=7168,
2025-05-07T20:32:34.7173813Z     scale_ub=1200.0,
2025-05-07T20:32:34.7174055Z     contiguous=True,
2025-05-07T20:32:34.7174272Z     compiled=False,
2025-05-07T20:32:34.7174483Z )
2025-05-07T20:32:34.7174809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7175413Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.7175687Z 
2025-05-07T20:32:34.7175765Z     @given(
2025-05-07T20:32:34.7175996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7176313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7176696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7177031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7177367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7177650Z     )
2025-05-07T20:32:34.7178008Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7178464Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7178708Z         self,
2025-05-07T20:32:34.7178901Z         T: int,
2025-05-07T20:32:34.7179100Z         D: int,
2025-05-07T20:32:34.7179321Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7179593Z         contiguous: bool,
2025-05-07T20:32:34.7179837Z         compiled: bool,
2025-05-07T20:32:34.7180055Z     ) -> None:
2025-05-07T20:32:34.7180266Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7180510Z     
2025-05-07T20:32:34.7180780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7181119Z     
2025-05-07T20:32:34.7181313Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7181606Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7181915Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7182163Z         x0 = x[:, :D]
2025-05-07T20:32:34.7182380Z         x1 = x[:, D:]
2025-05-07T20:32:34.7182582Z     
2025-05-07T20:32:34.7182766Z         if contiguous:
2025-05-07T20:32:34.7183003Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7183260Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7183507Z     
2025-05-07T20:32:34.7183701Z         if scale_ub is not None:
2025-05-07T20:32:34.7183974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7184308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7184624Z             )
2025-05-07T20:32:34.7184823Z         else:
2025-05-07T20:32:34.7185036Z             scale_ub_tensor = None
2025-05-07T20:32:34.7185293Z     
2025-05-07T20:32:34.7185534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7185859Z             op = silu_mul_quant
2025-05-07T20:32:34.7186117Z             if compiled:
2025-05-07T20:32:34.7186375Z                 op = torch.compile(op)
2025-05-07T20:32:34.7186726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7187015Z     
2025-05-07T20:32:34.7187213Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.7187382Z 
2025-05-07T20:32:34.7187486Z moe/activation_test.py:117: 
2025-05-07T20:32:34.7187792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7188136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.7188429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7189146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.7189866Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.7190426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7191136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7191831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7192387Z     kernel = self.compile(
2025-05-07T20:32:34.7192995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7193673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7194085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7194364Z 
2025-05-07T20:32:34.7194582Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175dbce0>
2025-05-07T20:32:34.7195705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7197176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917552520>}
2025-05-07T20:32:34.7198582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7199650Z context = <triton._C.libtriton.ir.context object at 0x7f89176cb5f0>
2025-05-07T20:32:34.7199949Z 
2025-05-07T20:32:34.7200126Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7200660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7201143Z                            module_map=module_map)
2025-05-07T20:32:34.7201520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7201889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7202148Z E       ^
2025-05-07T20:32:34.7202630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7203096Z 
2025-05-07T20:32:34.7203543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7204078Z 
2025-05-07T20:32:34.7204188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7204611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7205031Z     T=128,
2025-05-07T20:32:34.7205225Z     D=5120,
2025-05-07T20:32:34.7205415Z     scale_ub=None,
2025-05-07T20:32:34.7205633Z     contiguous=True,
2025-05-07T20:32:34.7205867Z     compiled=False,
2025-05-07T20:32:34.7206072Z )
2025-05-07T20:32:34.7879555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7880679Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.7881238Z 
2025-05-07T20:32:34.7881395Z     @given(
2025-05-07T20:32:34.7882171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7882797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7883218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7883559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7883891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7884173Z     )
2025-05-07T20:32:34.7884531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7884995Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7885239Z         self,
2025-05-07T20:32:34.7885440Z         T: int,
2025-05-07T20:32:34.7885644Z         D: int,
2025-05-07T20:32:34.7885860Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7886140Z         contiguous: bool,
2025-05-07T20:32:34.7886386Z         compiled: bool,
2025-05-07T20:32:34.7886611Z     ) -> None:
2025-05-07T20:32:34.7886831Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7887077Z     
2025-05-07T20:32:34.7887356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7887704Z     
2025-05-07T20:32:34.7887900Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7888281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7888597Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7888846Z         x0 = x[:, :D]
2025-05-07T20:32:34.7889069Z         x1 = x[:, D:]
2025-05-07T20:32:34.7889337Z     
2025-05-07T20:32:34.7889526Z         if contiguous:
2025-05-07T20:32:34.7889761Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7890016Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7890258Z     
2025-05-07T20:32:34.7890450Z         if scale_ub is not None:
2025-05-07T20:32:34.7890718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7891131Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7891444Z             )
2025-05-07T20:32:34.7891633Z         else:
2025-05-07T20:32:34.7891929Z             scale_ub_tensor = None
2025-05-07T20:32:34.7892189Z     
2025-05-07T20:32:34.7892418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7892740Z             op = silu_mul_quant
2025-05-07T20:32:34.7892996Z             if compiled:
2025-05-07T20:32:34.7893244Z                 op = torch.compile(op)
2025-05-07T20:32:34.7893543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7893824Z     
2025-05-07T20:32:34.7894018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.7894191Z 
2025-05-07T20:32:34.7894290Z moe/activation_test.py:117: 
2025-05-07T20:32:34.7894587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7894923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.7895203Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7895918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.7896634Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.7897181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7897892Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7898580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7899131Z     kernel = self.compile(
2025-05-07T20:32:34.7899685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7900364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7900773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7901012Z 
2025-05-07T20:32:34.7901227Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176e9fa0>
2025-05-07T20:32:34.7902395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7903831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917553420>}
2025-05-07T20:32:34.7905224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7906607Z context = <triton._C.libtriton.ir.context object at 0x7f891728e1b0>
2025-05-07T20:32:34.7906915Z 
2025-05-07T20:32:34.7907098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7907647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7908135Z                            module_map=module_map)
2025-05-07T20:32:34.7908514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7908956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7909234Z E       ^
2025-05-07T20:32:34.7909718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7910244Z 
2025-05-07T20:32:34.7910683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7911217Z 
2025-05-07T20:32:34.7911326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7911754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7912237Z     T=128,
2025-05-07T20:32:34.7912429Z     D=7168,
2025-05-07T20:32:34.7912628Z     scale_ub=None,
2025-05-07T20:32:34.7912849Z     contiguous=True,
2025-05-07T20:32:34.7913074Z     compiled=False,
2025-05-07T20:32:34.7913287Z )
2025-05-07T20:32:34.7913619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.7914126Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.7914411Z 
2025-05-07T20:32:34.7914490Z     @given(
2025-05-07T20:32:34.7914726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.7915050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.7915363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.7915701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.7916038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.7916327Z     )
2025-05-07T20:32:34.7916688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.7917150Z     def test_silu_mul_quant(
2025-05-07T20:32:34.7917397Z         self,
2025-05-07T20:32:34.7917604Z         T: int,
2025-05-07T20:32:34.7917811Z         D: int,
2025-05-07T20:32:34.7918034Z         scale_ub: Optional[float],
2025-05-07T20:32:34.7918313Z         contiguous: bool,
2025-05-07T20:32:34.7918563Z         compiled: bool,
2025-05-07T20:32:34.7918798Z     ) -> None:
2025-05-07T20:32:34.7919018Z         torch.manual_seed(2025)
2025-05-07T20:32:34.7919274Z     
2025-05-07T20:32:34.7919558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.7919910Z     
2025-05-07T20:32:34.7920112Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.7920413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.7920732Z         x = x_sign * x_clamp
2025-05-07T20:32:34.7920994Z         x0 = x[:, :D]
2025-05-07T20:32:34.7921225Z         x1 = x[:, D:]
2025-05-07T20:32:34.7921440Z     
2025-05-07T20:32:34.7921637Z         if contiguous:
2025-05-07T20:32:34.7921879Z             x0 = x0.contiguous()
2025-05-07T20:32:34.7930952Z             x1 = x1.contiguous()
2025-05-07T20:32:34.7931222Z     
2025-05-07T20:32:34.7931547Z         if scale_ub is not None:
2025-05-07T20:32:34.7931906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.7932260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.7932577Z             )
2025-05-07T20:32:34.7932772Z         else:
2025-05-07T20:32:34.7932983Z             scale_ub_tensor = None
2025-05-07T20:32:34.7933234Z     
2025-05-07T20:32:34.7933473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.7933802Z             op = silu_mul_quant
2025-05-07T20:32:34.7934058Z             if compiled:
2025-05-07T20:32:34.7934312Z                 op = torch.compile(op)
2025-05-07T20:32:34.7934615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7934890Z     
2025-05-07T20:32:34.7935091Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.7935264Z 
2025-05-07T20:32:34.7935368Z moe/activation_test.py:117: 
2025-05-07T20:32:34.7935675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7936018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.7936310Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.7937176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.7937894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.7938451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.7939977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.7940674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.7941272Z     kernel = self.compile(
2025-05-07T20:32:34.7941843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.7942535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7942948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.7943193Z 
2025-05-07T20:32:34.7943409Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176ea420>
2025-05-07T20:32:34.7944543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.7945996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172984a0>}
2025-05-07T20:32:34.7947397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.7948465Z context = <triton._C.libtriton.ir.context object at 0x7f89172a1b30>
2025-05-07T20:32:34.7948771Z 
2025-05-07T20:32:34.7948945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.7949497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7949985Z                            module_map=module_map)
2025-05-07T20:32:34.7950364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7950738Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7951010Z E       ^
2025-05-07T20:32:34.7951493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7951969Z 
2025-05-07T20:32:34.7952407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.7952979Z 
2025-05-07T20:32:34.7953105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.7953587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.7954004Z     T=2048,
2025-05-07T20:32:34.7954198Z     D=7168,
2025-05-07T20:32:34.7954400Z     scale_ub=1200.0,
2025-05-07T20:32:34.7954623Z     contiguous=True,
2025-05-07T20:32:34.7954855Z     compiled=False,
2025-05-07T20:32:34.7955067Z )
2025-05-07T20:32:34.8749263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8750009Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.8750380Z 
2025-05-07T20:32:34.8750480Z     @given(
2025-05-07T20:32:34.8750775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8751098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8751413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8751751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8752086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8752384Z     )
2025-05-07T20:32:34.8752739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8753465Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8753724Z         self,
2025-05-07T20:32:34.8753923Z         T: int,
2025-05-07T20:32:34.8754130Z         D: int,
2025-05-07T20:32:34.8754358Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8754748Z         contiguous: bool,
2025-05-07T20:32:34.8754996Z         compiled: bool,
2025-05-07T20:32:34.8755230Z     ) -> None:
2025-05-07T20:32:34.8755457Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8755716Z     
2025-05-07T20:32:34.8755996Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8758252Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.8760335Z 
2025-05-07T20:32:34.8760462Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.8760694Z 
2025-05-07T20:32:34.8760799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8761233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8761654Z     T=1,
2025-05-07T20:32:34.8761842Z     D=5120,
2025-05-07T20:32:34.8762047Z     scale_ub=1200.0,
2025-05-07T20:32:34.8762286Z     contiguous=True,
2025-05-07T20:32:34.8762511Z     compiled=False,
2025-05-07T20:32:34.8762730Z )
2025-05-07T20:32:34.8763063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8763569Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.8763852Z 
2025-05-07T20:32:34.8763934Z     @given(
2025-05-07T20:32:34.8764172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8764498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8764813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8765153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8765492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8765780Z     )
2025-05-07T20:32:34.8766143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8766603Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8766850Z         self,
2025-05-07T20:32:34.8767052Z         T: int,
2025-05-07T20:32:34.8767258Z         D: int,
2025-05-07T20:32:34.8767479Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8767759Z         contiguous: bool,
2025-05-07T20:32:34.8768098Z         compiled: bool,
2025-05-07T20:32:34.8768327Z     ) -> None:
2025-05-07T20:32:34.8768551Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8768800Z     
2025-05-07T20:32:34.8769072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8769423Z     
2025-05-07T20:32:34.8769617Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8769915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8770230Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8770474Z         x0 = x[:, :D]
2025-05-07T20:32:34.8770693Z         x1 = x[:, D:]
2025-05-07T20:32:34.8770898Z     
2025-05-07T20:32:34.8771086Z         if contiguous:
2025-05-07T20:32:34.8771323Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8771582Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8771904Z     
2025-05-07T20:32:34.8772101Z         if scale_ub is not None:
2025-05-07T20:32:34.8772372Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8772721Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8773057Z             )
2025-05-07T20:32:34.8773273Z         else:
2025-05-07T20:32:34.8773541Z             scale_ub_tensor = None
2025-05-07T20:32:34.8773799Z     
2025-05-07T20:32:34.8774028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8774351Z             op = silu_mul_quant
2025-05-07T20:32:34.8774647Z             if compiled:
2025-05-07T20:32:34.8774897Z                 op = torch.compile(op)
2025-05-07T20:32:34.8775196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8775479Z     
2025-05-07T20:32:34.8775677Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8775845Z 
2025-05-07T20:32:34.8775946Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8776299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8776642Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8776929Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8777651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8778367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8778918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8779614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8780299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8780847Z     kernel = self.compile(
2025-05-07T20:32:34.8781398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8782077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8782487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8782722Z 
2025-05-07T20:32:34.8782938Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176eade0>
2025-05-07T20:32:34.8784056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8785481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917299a80>}
2025-05-07T20:32:34.8786875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8787935Z context = <triton._C.libtriton.ir.context object at 0x7f8917212c70>
2025-05-07T20:32:34.8788230Z 
2025-05-07T20:32:34.8788456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8788991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8789476Z                            module_map=module_map)
2025-05-07T20:32:34.8789849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8790209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8790467Z E       ^
2025-05-07T20:32:34.8790945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8791409Z 
2025-05-07T20:32:34.8791848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8792381Z 
2025-05-07T20:32:34.8792493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8792912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8793327Z     T=2048,
2025-05-07T20:32:34.8793518Z     D=5120,
2025-05-07T20:32:34.8793708Z     scale_ub=None,
2025-05-07T20:32:34.8793927Z     contiguous=True,
2025-05-07T20:32:34.8794153Z     compiled=False,
2025-05-07T20:32:34.8794400Z )
2025-05-07T20:32:34.8794732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8795245Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.8795566Z 
2025-05-07T20:32:34.8795643Z     @given(
2025-05-07T20:32:34.8795879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8796198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8796505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8796843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8797220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8797517Z     )
2025-05-07T20:32:34.8797872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8798323Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8798573Z         self,
2025-05-07T20:32:34.8798768Z         T: int,
2025-05-07T20:32:34.8798977Z         D: int,
2025-05-07T20:32:34.8799198Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8799471Z         contiguous: bool,
2025-05-07T20:32:34.8799718Z         compiled: bool,
2025-05-07T20:32:34.8799966Z     ) -> None:
2025-05-07T20:32:34.8800190Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8800437Z     
2025-05-07T20:32:34.8800709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8801059Z     
2025-05-07T20:32:34.8801254Z >       x_sign = torch.sign(x)
2025-05-07T20:32:34.8803345Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.8805295Z 
2025-05-07T20:32:34.8805422Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:34.8805641Z 
2025-05-07T20:32:34.8805746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8806544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8806967Z     T=16384,
2025-05-07T20:32:34.8807156Z     D=5120,
2025-05-07T20:32:34.8807352Z     scale_ub=None,
2025-05-07T20:32:34.8807570Z     contiguous=True,
2025-05-07T20:32:34.8807795Z     compiled=False,
2025-05-07T20:32:34.8808008Z )
2025-05-07T20:32:34.9557972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9558854Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.9559152Z 
2025-05-07T20:32:34.9559235Z     @given(
2025-05-07T20:32:34.9559475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9559785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9560093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9560427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9560756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9561045Z     )
2025-05-07T20:32:34.9561397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9561847Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9562087Z         self,
2025-05-07T20:32:34.9562286Z         T: int,
2025-05-07T20:32:34.9562505Z         D: int,
2025-05-07T20:32:34.9562746Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9563018Z         contiguous: bool,
2025-05-07T20:32:34.9563264Z         compiled: bool,
2025-05-07T20:32:34.9563491Z     ) -> None:
2025-05-07T20:32:34.9563707Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9563949Z     
2025-05-07T20:32:34.9564295Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9566434Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.9568531Z 
2025-05-07T20:32:34.9568653Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.9568876Z 
2025-05-07T20:32:34.9568981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9569405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9569816Z     T=4096,
2025-05-07T20:32:34.9570010Z     D=5120,
2025-05-07T20:32:34.9570206Z     scale_ub=None,
2025-05-07T20:32:34.9570417Z     contiguous=True,
2025-05-07T20:32:34.9570645Z     compiled=False,
2025-05-07T20:32:34.9570857Z )
2025-05-07T20:32:34.9571180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9571689Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.9572034Z 
2025-05-07T20:32:34.9572122Z     @given(
2025-05-07T20:32:34.9572357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9572675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9572988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9573323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9573656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9573951Z     )
2025-05-07T20:32:34.9574307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9574757Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9575004Z         self,
2025-05-07T20:32:34.9575202Z         T: int,
2025-05-07T20:32:34.9575398Z         D: int,
2025-05-07T20:32:34.9575624Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9575903Z         contiguous: bool,
2025-05-07T20:32:34.9576143Z         compiled: bool,
2025-05-07T20:32:34.9576369Z     ) -> None:
2025-05-07T20:32:34.9576586Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9576830Z     
2025-05-07T20:32:34.9577097Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9579275Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.9581208Z 
2025-05-07T20:32:34.9581330Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.9581546Z 
2025-05-07T20:32:34.9581655Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9582071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9582482Z     T=2048,
2025-05-07T20:32:34.9582670Z     D=5120,
2025-05-07T20:32:34.9582864Z     scale_ub=None,
2025-05-07T20:32:34.9583077Z     contiguous=False,
2025-05-07T20:32:34.9583307Z     compiled=False,
2025-05-07T20:32:34.9583512Z )
2025-05-07T20:32:34.9583831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9584347Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.9584627Z 
2025-05-07T20:32:34.9584755Z     @given(
2025-05-07T20:32:34.9584981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9585296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9585606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9585977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9586313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9586602Z     )
2025-05-07T20:32:34.9586959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9587449Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9587697Z         self,
2025-05-07T20:32:34.9587894Z         T: int,
2025-05-07T20:32:34.9588089Z         D: int,
2025-05-07T20:32:34.9588313Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9588593Z         contiguous: bool,
2025-05-07T20:32:34.9588830Z         compiled: bool,
2025-05-07T20:32:34.9589057Z     ) -> None:
2025-05-07T20:32:34.9589280Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9589520Z     
2025-05-07T20:32:34.9589795Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9591919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.9593861Z 
2025-05-07T20:32:34.9593980Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.9594197Z 
2025-05-07T20:32:34.9594312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9594731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9595143Z     T=4096,
2025-05-07T20:32:34.9595337Z     D=7168,
2025-05-07T20:32:34.9595524Z     scale_ub=None,
2025-05-07T20:32:34.9595743Z     contiguous=True,
2025-05-07T20:32:34.9595974Z     compiled=True,
2025-05-07T20:32:34.9596173Z )
2025-05-07T20:32:34.9596495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9597004Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.9597281Z 
2025-05-07T20:32:34.9597367Z     @given(
2025-05-07T20:32:34.9597596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9597914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9598224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9598601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9598940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9599233Z     )
2025-05-07T20:32:34.9599585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9600036Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9600279Z         self,
2025-05-07T20:32:34.9600470Z         T: int,
2025-05-07T20:32:34.9600672Z         D: int,
2025-05-07T20:32:34.9600892Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9601160Z         contiguous: bool,
2025-05-07T20:32:34.9601407Z         compiled: bool,
2025-05-07T20:32:34.9601631Z     ) -> None:
2025-05-07T20:32:34.9601853Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9602090Z     
2025-05-07T20:32:34.9602370Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9604546Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.9606799Z 
2025-05-07T20:32:34.9606926Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.9607142Z 
2025-05-07T20:32:34.9607245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9607666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9608156Z     T=2048,
2025-05-07T20:32:34.9608356Z     D=5120,
2025-05-07T20:32:34.9608554Z     scale_ub=1200.0,
2025-05-07T20:32:34.9608793Z     contiguous=False,
2025-05-07T20:32:34.9609033Z     compiled=False,
2025-05-07T20:32:34.9609245Z )
2025-05-07T20:32:34.9609604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9610188Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.9610513Z 
2025-05-07T20:32:34.9610593Z     @given(
2025-05-07T20:32:34.9610837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9611187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9611524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9611964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9612303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9612599Z     )
2025-05-07T20:32:34.9612947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9613405Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9613656Z         self,
2025-05-07T20:32:34.9613848Z         T: int,
2025-05-07T20:32:34.9614049Z         D: int,
2025-05-07T20:32:34.9614271Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9614540Z         contiguous: bool,
2025-05-07T20:32:34.9614786Z         compiled: bool,
2025-05-07T20:32:34.9615014Z     ) -> None:
2025-05-07T20:32:34.9615225Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9615468Z     
2025-05-07T20:32:34.9615742Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9617864Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.9619872Z 
2025-05-07T20:32:34.9619998Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.9620214Z 
2025-05-07T20:32:34.9620317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9620739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9621154Z     T=4096,
2025-05-07T20:32:34.9621336Z     D=7168,
2025-05-07T20:32:34.9621527Z     scale_ub=1200.0,
2025-05-07T20:32:34.9621754Z     contiguous=True,
2025-05-07T20:32:34.9621973Z     compiled=False,
2025-05-07T20:32:34.9622179Z )
2025-05-07T20:32:35.0687492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0688276Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.0688674Z 
2025-05-07T20:32:35.0688754Z     @given(
2025-05-07T20:32:35.0689007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0689335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0689649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0689984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0690551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0690840Z     )
2025-05-07T20:32:35.0691203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0691663Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0701341Z         self,
2025-05-07T20:32:35.0701592Z         T: int,
2025-05-07T20:32:35.0701797Z         D: int,
2025-05-07T20:32:35.0702026Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0702311Z         contiguous: bool,
2025-05-07T20:32:35.0702557Z         compiled: bool,
2025-05-07T20:32:35.0702795Z     ) -> None:
2025-05-07T20:32:35.0703177Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0703422Z     
2025-05-07T20:32:35.0703706Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0705865Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0708091Z 
2025-05-07T20:32:35.0708225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0708447Z 
2025-05-07T20:32:35.0708560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0708986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0709413Z     T=16384,
2025-05-07T20:32:35.0709616Z     D=7168,
2025-05-07T20:32:35.0709815Z     scale_ub=None,
2025-05-07T20:32:35.0710044Z     contiguous=False,
2025-05-07T20:32:35.0710283Z     compiled=True,
2025-05-07T20:32:35.0710494Z )
2025-05-07T20:32:35.0710823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0711347Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.0711634Z 
2025-05-07T20:32:35.0711715Z     @given(
2025-05-07T20:32:35.0711953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0712280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0712599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0712934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0713271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0713570Z     )
2025-05-07T20:32:35.0713926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0714385Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0714637Z         self,
2025-05-07T20:32:35.0714922Z         T: int,
2025-05-07T20:32:35.0715131Z         D: int,
2025-05-07T20:32:35.0715369Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0715663Z         contiguous: bool,
2025-05-07T20:32:35.0715929Z         compiled: bool,
2025-05-07T20:32:35.0716169Z     ) -> None:
2025-05-07T20:32:35.0716398Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0716668Z     
2025-05-07T20:32:35.0716966Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0719569Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0721952Z 
2025-05-07T20:32:35.0722088Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0722400Z 
2025-05-07T20:32:35.0722509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0722938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0723358Z     T=4096,
2025-05-07T20:32:35.0723612Z     D=7168,
2025-05-07T20:32:35.0723815Z     scale_ub=None,
2025-05-07T20:32:35.0724046Z     contiguous=True,
2025-05-07T20:32:35.0724276Z     compiled=False,
2025-05-07T20:32:35.0724486Z )
2025-05-07T20:32:35.0724807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0725318Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.0725668Z 
2025-05-07T20:32:35.0725751Z     @given(
2025-05-07T20:32:35.0725991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0726319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0726631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0726975Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0727319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0727609Z     )
2025-05-07T20:32:35.0727971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0728436Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0728692Z         self,
2025-05-07T20:32:35.0728891Z         T: int,
2025-05-07T20:32:35.0729095Z         D: int,
2025-05-07T20:32:35.0729321Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0729595Z         contiguous: bool,
2025-05-07T20:32:35.0729844Z         compiled: bool,
2025-05-07T20:32:35.0730076Z     ) -> None:
2025-05-07T20:32:35.0730294Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0730547Z     
2025-05-07T20:32:35.0730824Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0733037Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0734979Z 
2025-05-07T20:32:35.0735100Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0735324Z 
2025-05-07T20:32:35.0735432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0735861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0736279Z     T=16384,
2025-05-07T20:32:35.0736476Z     D=7168,
2025-05-07T20:32:35.0736730Z     scale_ub=None,
2025-05-07T20:32:35.0736955Z     contiguous=True,
2025-05-07T20:32:35.0737180Z     compiled=False,
2025-05-07T20:32:35.0737390Z )
2025-05-07T20:32:35.0737718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0738226Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.0738522Z 
2025-05-07T20:32:35.0738605Z     @given(
2025-05-07T20:32:35.0738847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0739161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0739479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0739818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0740156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0740446Z     )
2025-05-07T20:32:35.0740811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0741278Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0741527Z         self,
2025-05-07T20:32:35.0741733Z         T: int,
2025-05-07T20:32:35.0741940Z         D: int,
2025-05-07T20:32:35.0742583Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0742874Z         contiguous: bool,
2025-05-07T20:32:35.0743124Z         compiled: bool,
2025-05-07T20:32:35.0743351Z     ) -> None:
2025-05-07T20:32:35.0743577Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0743880Z     
2025-05-07T20:32:35.0744155Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0746283Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0748272Z 
2025-05-07T20:32:35.0748399Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0748622Z 
2025-05-07T20:32:35.0748728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0749157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0749572Z     T=16384,
2025-05-07T20:32:35.0749775Z     D=7168,
2025-05-07T20:32:35.0749974Z     scale_ub=1200.0,
2025-05-07T20:32:35.0750198Z     contiguous=True,
2025-05-07T20:32:35.0750429Z     compiled=False,
2025-05-07T20:32:35.0750641Z )
2025-05-07T20:32:35.0750966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0751485Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.0751780Z 
2025-05-07T20:32:35.0751863Z     @given(
2025-05-07T20:32:35.0752100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0752446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0752793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0753133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0753466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0753759Z     )
2025-05-07T20:32:35.0754128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0754586Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0754834Z         self,
2025-05-07T20:32:35.0755037Z         T: int,
2025-05-07T20:32:35.0755243Z         D: int,
2025-05-07T20:32:35.0755463Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0755743Z         contiguous: bool,
2025-05-07T20:32:35.0755994Z         compiled: bool,
2025-05-07T20:32:35.0756218Z     ) -> None:
2025-05-07T20:32:35.0756443Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0756695Z     
2025-05-07T20:32:35.0757019Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0759158Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0761094Z 
2025-05-07T20:32:35.0761215Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0761440Z 
2025-05-07T20:32:35.0761545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0761975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0762390Z     T=128,
2025-05-07T20:32:35.0762587Z     D=5120,
2025-05-07T20:32:35.0762786Z     scale_ub=1200.0,
2025-05-07T20:32:35.0763012Z     contiguous=False,
2025-05-07T20:32:35.0763294Z     compiled=False,
2025-05-07T20:32:35.0763512Z )
2025-05-07T20:32:35.2037543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2038961Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.2039956Z 
2025-05-07T20:32:35.2040118Z     @given(
2025-05-07T20:32:35.2040582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2041213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2041870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2042566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2042899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2043196Z     )
2025-05-07T20:32:35.2043556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2044013Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2044268Z         self,
2025-05-07T20:32:35.2044469Z         T: int,
2025-05-07T20:32:35.2044670Z         D: int,
2025-05-07T20:32:35.2044894Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2045170Z         contiguous: bool,
2025-05-07T20:32:35.2045408Z         compiled: bool,
2025-05-07T20:32:35.2045647Z     ) -> None:
2025-05-07T20:32:35.2045869Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2046112Z     
2025-05-07T20:32:35.2046394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2046742Z     
2025-05-07T20:32:35.2046936Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2047233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2047555Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2047795Z         x0 = x[:, :D]
2025-05-07T20:32:35.2048022Z         x1 = x[:, D:]
2025-05-07T20:32:35.2048237Z     
2025-05-07T20:32:35.2048428Z         if contiguous:
2025-05-07T20:32:35.2048670Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2048934Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2049174Z     
2025-05-07T20:32:35.2049371Z         if scale_ub is not None:
2025-05-07T20:32:35.2049648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2049985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2050316Z             )
2025-05-07T20:32:35.2050520Z         else:
2025-05-07T20:32:35.2050737Z             scale_ub_tensor = None
2025-05-07T20:32:35.2050989Z     
2025-05-07T20:32:35.2051228Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2051549Z             op = silu_mul_quant
2025-05-07T20:32:35.2051805Z             if compiled:
2025-05-07T20:32:35.2052168Z                 op = torch.compile(op)
2025-05-07T20:32:35.2052478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2052751Z     
2025-05-07T20:32:35.2053072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2053266Z 
2025-05-07T20:32:35.2053378Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2053682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2054031Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2054322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2055046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2055760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2056318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2057027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2057711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2058265Z     kernel = self.compile(
2025-05-07T20:32:35.2058826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2059583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2059989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2060229Z 
2025-05-07T20:32:35.2060483Z self = <triton.compiler.compiler.ASTSource object at 0x7f89174534d0>
2025-05-07T20:32:35.2061605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2063087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173e07c0>}
2025-05-07T20:32:35.2064473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2065533Z context = <triton._C.libtriton.ir.context object at 0x7f891732a470>
2025-05-07T20:32:35.2065838Z 
2025-05-07T20:32:35.2066011Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2066555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2067030Z                            module_map=module_map)
2025-05-07T20:32:35.2067404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2067771Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2068039Z E       ^
2025-05-07T20:32:35.2068515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2068985Z 
2025-05-07T20:32:35.2069418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2069947Z 
2025-05-07T20:32:35.2070063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2070488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2070898Z     T=2048,
2025-05-07T20:32:35.2071093Z     D=7168,
2025-05-07T20:32:35.2071292Z     scale_ub=None,
2025-05-07T20:32:35.2071509Z     contiguous=False,
2025-05-07T20:32:35.2071741Z     compiled=False,
2025-05-07T20:32:35.2071953Z )
2025-05-07T20:32:35.2072275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2072792Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.2073077Z 
2025-05-07T20:32:35.2073163Z     @given(
2025-05-07T20:32:35.2073395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2073717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2074081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2074423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2074757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2075050Z     )
2025-05-07T20:32:35.2075410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2075861Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2076112Z         self,
2025-05-07T20:32:35.2076312Z         T: int,
2025-05-07T20:32:35.2076504Z         D: int,
2025-05-07T20:32:35.2076731Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2077010Z         contiguous: bool,
2025-05-07T20:32:35.2077252Z         compiled: bool,
2025-05-07T20:32:35.2077489Z     ) -> None:
2025-05-07T20:32:35.2077710Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2077949Z     
2025-05-07T20:32:35.2078226Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2080421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2082399Z 
2025-05-07T20:32:35.2082520Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.2082736Z 
2025-05-07T20:32:35.2082846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2083317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2083731Z     T=128,
2025-05-07T20:32:35.2083927Z     D=7168,
2025-05-07T20:32:35.2084114Z     scale_ub=1200.0,
2025-05-07T20:32:35.2084344Z     contiguous=True,
2025-05-07T20:32:35.2084578Z     compiled=True,
2025-05-07T20:32:35.2084779Z )
2025-05-07T20:32:35.2393314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2394782Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.2395547Z 
2025-05-07T20:32:35.2395749Z     @given(
2025-05-07T20:32:35.2396362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2397205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2397993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2398759Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2399401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2399971Z     )
2025-05-07T20:32:35.2400665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2401538Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2402021Z         self,
2025-05-07T20:32:35.2402395Z         T: int,
2025-05-07T20:32:35.2402775Z         D: int,
2025-05-07T20:32:35.2403059Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2403377Z         contiguous: bool,
2025-05-07T20:32:35.2403619Z         compiled: bool,
2025-05-07T20:32:35.2403841Z     ) -> None:
2025-05-07T20:32:35.2404056Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2404304Z     
2025-05-07T20:32:35.2404574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2404921Z     
2025-05-07T20:32:35.2405116Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2405403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2405719Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2405962Z         x0 = x[:, :D]
2025-05-07T20:32:35.2406463Z         x1 = x[:, D:]
2025-05-07T20:32:35.2406676Z     
2025-05-07T20:32:35.2406865Z         if contiguous:
2025-05-07T20:32:35.2407091Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2407559Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2407804Z     
2025-05-07T20:32:35.2407994Z         if scale_ub is not None:
2025-05-07T20:32:35.2408273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2408611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2408926Z             )
2025-05-07T20:32:35.2409119Z         else:
2025-05-07T20:32:35.2409337Z             scale_ub_tensor = None
2025-05-07T20:32:35.2409590Z     
2025-05-07T20:32:35.2409824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2410146Z             op = silu_mul_quant
2025-05-07T20:32:35.2410402Z             if compiled:
2025-05-07T20:32:35.2410647Z                 op = torch.compile(op)
2025-05-07T20:32:35.2410955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2411239Z     
2025-05-07T20:32:35.2411431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2411604Z 
2025-05-07T20:32:35.2411704Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2412092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2412430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2412789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2413368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.2413944Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.2414686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2415400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2415953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2416733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2417420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2417970Z     kernel = self.compile(
2025-05-07T20:32:35.2418530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2419199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2419612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2419857Z 
2025-05-07T20:32:35.2420067Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917116600>
2025-05-07T20:32:35.2421186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2422619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173e1940>}
2025-05-07T20:32:35.2424010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2425073Z context = <triton._C.libtriton.ir.context object at 0x7f89171736b0>
2025-05-07T20:32:35.2425370Z 
2025-05-07T20:32:35.2425550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2426091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2426568Z                            module_map=module_map)
2025-05-07T20:32:35.2426943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2427315Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2427577Z E       ^
2025-05-07T20:32:35.2428150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2428617Z 
2025-05-07T20:32:35.2429057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2429587Z 
2025-05-07T20:32:35.2429698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2430123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2430547Z     T=128,
2025-05-07T20:32:35.2430746Z     D=7168,
2025-05-07T20:32:35.2430942Z     scale_ub=1200.0,
2025-05-07T20:32:35.2431181Z     contiguous=True,
2025-05-07T20:32:35.2431414Z     compiled=False,
2025-05-07T20:32:35.2431624Z )
2025-05-07T20:32:35.2431953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2432467Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.2432745Z 
2025-05-07T20:32:35.2432833Z     @given(
2025-05-07T20:32:35.2433063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2433385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2433702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2434084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2434425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2434719Z     )
2025-05-07T20:32:35.2435072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2435592Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2435847Z         self,
2025-05-07T20:32:35.2436075Z         T: int,
2025-05-07T20:32:35.2436273Z         D: int,
2025-05-07T20:32:35.2436498Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2436775Z         contiguous: bool,
2025-05-07T20:32:35.2437063Z         compiled: bool,
2025-05-07T20:32:35.2437296Z     ) -> None:
2025-05-07T20:32:35.2437516Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2437758Z     
2025-05-07T20:32:35.2438038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2438390Z     
2025-05-07T20:32:35.2438584Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2438885Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2440983Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2442965Z 
2025-05-07T20:32:35.2443114Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.2443333Z 
2025-05-07T20:32:35.2443447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2443869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2444305Z     T=128,
2025-05-07T20:32:35.2444506Z     D=5120,
2025-05-07T20:32:35.2444703Z     scale_ub=1200.0,
2025-05-07T20:32:35.2444934Z     contiguous=True,
2025-05-07T20:32:35.2445166Z     compiled=True,
2025-05-07T20:32:35.2445368Z )
2025-05-07T20:32:35.2445700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2446215Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.2446493Z 
2025-05-07T20:32:35.2446580Z     @given(
2025-05-07T20:32:35.2446815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2456710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2457038Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2457371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2457806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2458098Z     )
2025-05-07T20:32:35.2458452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2458911Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2459161Z         self,
2025-05-07T20:32:35.2459365Z         T: int,
2025-05-07T20:32:35.2459562Z         D: int,
2025-05-07T20:32:35.2459789Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2460074Z         contiguous: bool,
2025-05-07T20:32:35.2460319Z         compiled: bool,
2025-05-07T20:32:35.2460550Z     ) -> None:
2025-05-07T20:32:35.2460776Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2461018Z     
2025-05-07T20:32:35.2461301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2461659Z     
2025-05-07T20:32:35.2461852Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2462149Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2464281Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2466728Z 
2025-05-07T20:32:35.2466857Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.2467074Z 
2025-05-07T20:32:35.2467184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2467600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2468065Z     T=128,
2025-05-07T20:32:35.2468258Z     D=7168,
2025-05-07T20:32:35.2468450Z     scale_ub=None,
2025-05-07T20:32:35.2468670Z     contiguous=True,
2025-05-07T20:32:35.2468900Z     compiled=True,
2025-05-07T20:32:35.2469103Z )
2025-05-07T20:32:35.4946987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4947551Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.4947832Z 
2025-05-07T20:32:35.4947913Z     @given(
2025-05-07T20:32:35.4948153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4948475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4948791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4949134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4949472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4949769Z     )
2025-05-07T20:32:35.4950138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4950596Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4950852Z         self,
2025-05-07T20:32:35.4951064Z         T: int,
2025-05-07T20:32:35.4951274Z         D: int,
2025-05-07T20:32:35.4951498Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4951784Z         contiguous: bool,
2025-05-07T20:32:35.4952040Z         compiled: bool,
2025-05-07T20:32:35.4952274Z     ) -> None:
2025-05-07T20:32:35.4952500Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4952782Z     
2025-05-07T20:32:35.4953079Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4955486Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.4957465Z 
2025-05-07T20:32:35.4973051Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.4973371Z 
2025-05-07T20:32:35.4973491Z FAILED
2025-05-07T20:32:35.4973648Z 
2025-05-07T20:32:35.4973831Z =================================== FAILURES ===================================
2025-05-07T20:32:35.4974448Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:35.4975102Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:35.4975996Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:35.4976791Z   |     yield
2025-05-07T20:32:35.4977414Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:35.4978182Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:35.4979005Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:35.4979825Z   |     if method() is not None:
2025-05-07T20:32:35.4980183Z   |        ^^^^^^^^
2025-05-07T20:32:35.4981453Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:35.4982568Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4983137Z   |            ^^^^^^^
2025-05-07T20:32:35.4983966Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:35.4984906Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:35.4985527Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:35.4986259Z   +-+---------------- 1 ----------------
2025-05-07T20:32:35.4986676Z     | Traceback (most recent call last):
2025-05-07T20:32:35.4987739Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.4988911Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4989454Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.4992499Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.4995562Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.4996208Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4996806Z     |     T=2048,
2025-05-07T20:32:35.4997141Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.4997628Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.4998142Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.4998675Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:35.4999111Z     | )
2025-05-07T20:32:35.4999363Z     | 
2025-05-07T20:32:35.5000135Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:35.5001050Z     +---------------- 2 ----------------
2025-05-07T20:32:35.5001470Z     | Traceback (most recent call last):
2025-05-07T20:32:35.5002606Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.5003784Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5004329Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5007614Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.5010157Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.5010609Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5011030Z     |     T=128,
2025-05-07T20:32:35.5011232Z     |     D=7168,
2025-05-07T20:32:35.5011436Z     |     scale_ub=None,
2025-05-07T20:32:35.5011779Z     |     contiguous=True,
2025-05-07T20:32:35.5012127Z     |     compiled=True,
2025-05-07T20:32:35.5012366Z     | )
2025-05-07T20:32:35.5012553Z     | 
2025-05-07T20:32:35.5013097Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.5013789Z     +---------------- 3 ----------------
2025-05-07T20:32:35.5014086Z     | Traceback (most recent call last):
2025-05-07T20:32:35.5014818Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.5015684Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5016075Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5018143Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.5020190Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.5020660Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5021083Z     |     T=128,
2025-05-07T20:32:35.5021295Z     |     D=5120,
2025-05-07T20:32:35.5021516Z     |     scale_ub=1200.0,
2025-05-07T20:32:35.5021767Z     |     contiguous=True,
2025-05-07T20:32:35.5022010Z     |     compiled=True,
2025-05-07T20:32:35.5022244Z     | )
2025-05-07T20:32:35.5022429Z     | 
2025-05-07T20:32:35.5023020Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.5023655Z     +---------------- 4 ----------------
2025-05-07T20:32:35.5023955Z     | Traceback (most recent call last):
2025-05-07T20:32:35.5024689Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:35.5025432Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5025730Z     |                              ^^^^^^^^
2025-05-07T20:32:35.5026591Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:35.5027709Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5028297Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5029639Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:35.5030754Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5031486Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:35.5032376Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5032897Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5033597Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:35.5034410Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5034898Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5035921Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:35.5036969Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5037587Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5038455Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:35.5039316Z     |     fn()
2025-05-07T20:32:35.5040165Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:35.5041221Z     |     self.fn.run(
2025-05-07T20:32:35.5041902Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:35.5042754Z     |     kernel = self.compile(
2025-05-07T20:32:35.5043138Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:35.5044010Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:35.5045052Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5045608Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5046563Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.5047741Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5048439Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.5048984Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5049493Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5049886Z     | ^
2025-05-07T20:32:35.5050513Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5051108Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.5051531Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:35.5052280Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5056543Z     |     T=1,  # or any other generated value
2025-05-07T20:32:35.5056972Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.5057455Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.5057952Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.5058460Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:35.5058992Z     | )
2025-05-07T20:32:35.5059234Z     | 
2025-05-07T20:32:35.5059975Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.5060828Z     +------------------------------------
2025-05-07T20:32:35.5061327Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:35.5061854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5062435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5063003Z     T=1,
2025-05-07T20:32:35.5063267Z     D=5120,
2025-05-07T20:32:35.5063533Z     scale_ub=None,
2025-05-07T20:32:35.5063835Z     contiguous=True,
2025-05-07T20:32:35.5064144Z     compiled=True,
2025-05-07T20:32:35.5064435Z )
2025-05-07T20:32:35.5064881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5065557Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5065929Z 
2025-05-07T20:32:35.5066038Z     @given(
2025-05-07T20:32:35.5066422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5066863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5067284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5067750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5068338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5068737Z     )
2025-05-07T20:32:35.5069248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5069904Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5070242Z         self,
2025-05-07T20:32:35.5070583Z         T: int,
2025-05-07T20:32:35.5070858Z         D: int,
2025-05-07T20:32:35.5071158Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5071542Z         contiguous: bool,
2025-05-07T20:32:35.5071884Z         compiled: bool,
2025-05-07T20:32:35.5072195Z     ) -> None:
2025-05-07T20:32:35.5072489Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5072835Z     
2025-05-07T20:32:35.5073214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5073693Z     
2025-05-07T20:32:35.5073968Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5074378Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5074817Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5075164Z         x0 = x[:, :D]
2025-05-07T20:32:35.5075475Z         x1 = x[:, D:]
2025-05-07T20:32:35.5075765Z     
2025-05-07T20:32:35.5076030Z         if contiguous:
2025-05-07T20:32:35.5076361Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5076722Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5077066Z     
2025-05-07T20:32:35.5077341Z         if scale_ub is not None:
2025-05-07T20:32:35.5077717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5078192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5078634Z             )
2025-05-07T20:32:35.5078898Z         else:
2025-05-07T20:32:35.5079200Z             scale_ub_tensor = None
2025-05-07T20:32:35.5079575Z     
2025-05-07T20:32:35.5079900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5080354Z             op = silu_mul_quant
2025-05-07T20:32:35.5080698Z             if compiled:
2025-05-07T20:32:35.5081038Z                 op = torch.compile(op)
2025-05-07T20:32:35.5081451Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5081839Z     
2025-05-07T20:32:35.5082106Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5082496Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5082970Z     
2025-05-07T20:32:35.5083300Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5083744Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5084150Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5084653Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5085143Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5085587Z     
2025-05-07T20:32:35.5085870Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5086138Z 
2025-05-07T20:32:35.5086285Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5086702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5087186Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5087649Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5088772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5089815Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5090574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5091532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5092673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5093677Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5094705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5095659Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5096518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5097264Z     fn()
2025-05-07T20:32:35.5098038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5098879Z     self.fn.run(
2025-05-07T20:32:35.5099535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5100301Z     kernel = self.compile(
2025-05-07T20:32:35.5101060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5101968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5102496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5102803Z 
2025-05-07T20:32:35.5103082Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef87c800>
2025-05-07T20:32:35.5104535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5106845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef86dc60>}
2025-05-07T20:32:35.5108716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5110143Z context = <triton._C.libtriton.ir.context object at 0x7f8b230310b0>
2025-05-07T20:32:35.5110537Z 
2025-05-07T20:32:35.5110770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5111488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5112136Z                            module_map=module_map)
2025-05-07T20:32:35.5112658Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5113175Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5113535Z E       ^
2025-05-07T20:32:35.5114368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5115024Z 
2025-05-07T20:32:35.5115621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5116347Z 
2025-05-07T20:32:35.5116489Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5117069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5117633Z     T=2048,
2025-05-07T20:32:35.5117890Z     D=5120,
2025-05-07T20:32:35.5118144Z     scale_ub=1200.0,
2025-05-07T20:32:35.5118451Z     contiguous=True,
2025-05-07T20:32:35.5118763Z     compiled=False,
2025-05-07T20:32:35.5119043Z )
2025-05-07T20:32:35.5119474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5120158Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.5120535Z 
2025-05-07T20:32:35.5120649Z     @given(
2025-05-07T20:32:35.5120963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5121385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5121869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5122307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5122740Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5123115Z     )
2025-05-07T20:32:35.5123656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5124249Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5124574Z         self,
2025-05-07T20:32:35.5124826Z         T: int,
2025-05-07T20:32:35.5125096Z         D: int,
2025-05-07T20:32:35.5125389Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5125826Z         contiguous: bool,
2025-05-07T20:32:35.5126160Z         compiled: bool,
2025-05-07T20:32:35.5126468Z     ) -> None:
2025-05-07T20:32:35.5126744Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5127082Z     
2025-05-07T20:32:35.5127461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5127933Z     
2025-05-07T20:32:35.5128200Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5128615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5129050Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5129410Z         x0 = x[:, :D]
2025-05-07T20:32:35.5129722Z         x1 = x[:, D:]
2025-05-07T20:32:35.5130017Z     
2025-05-07T20:32:35.5130270Z         if contiguous:
2025-05-07T20:32:35.5130590Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5130948Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5131277Z     
2025-05-07T20:32:35.5131542Z         if scale_ub is not None:
2025-05-07T20:32:35.5132016Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5132476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5132902Z             )
2025-05-07T20:32:35.5133173Z         else:
2025-05-07T20:32:35.5154615Z             scale_ub_tensor = None
2025-05-07T20:32:35.5154999Z     
2025-05-07T20:32:35.5155316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5155771Z             op = silu_mul_quant
2025-05-07T20:32:35.5156121Z             if compiled:
2025-05-07T20:32:35.5156448Z                 op = torch.compile(op)
2025-05-07T20:32:35.5156840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5157232Z     
2025-05-07T20:32:35.5157499Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5157732Z 
2025-05-07T20:32:35.5157865Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5158261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5158710Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5159098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5160064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5161125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5161847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5162761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5163660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5164386Z     kernel = self.compile(
2025-05-07T20:32:35.5165114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5165995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5166530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5166841Z 
2025-05-07T20:32:35.5167119Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b36080080>
2025-05-07T20:32:35.5168663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5170581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef6c8220>}
2025-05-07T20:32:35.5172583Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5174030Z context = <triton._C.libtriton.ir.context object at 0x7f8aefb2d130>
2025-05-07T20:32:35.5174495Z 
2025-05-07T20:32:35.5174732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5175462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5176137Z                            module_map=module_map)
2025-05-07T20:32:35.5176639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5177130Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5177488Z E       ^
2025-05-07T20:32:35.5178146Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5178814Z 
2025-05-07T20:32:35.5179426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5180192Z 
2025-05-07T20:32:35.5180350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5180930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5181502Z     T=2048,
2025-05-07T20:32:35.5181765Z     D=5120,
2025-05-07T20:32:35.5182034Z     scale_ub=1200.0,
2025-05-07T20:32:35.5182351Z     contiguous=True,
2025-05-07T20:32:35.5182679Z     compiled=True,
2025-05-07T20:32:35.5182960Z )
2025-05-07T20:32:35.5183409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5184106Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5184475Z 
2025-05-07T20:32:35.5184583Z     @given(
2025-05-07T20:32:35.5184901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5185354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5185791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5186245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5186697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5187113Z     )
2025-05-07T20:32:35.5187588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5188222Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5188556Z         self,
2025-05-07T20:32:35.5188828Z         T: int,
2025-05-07T20:32:35.5189184Z         D: int,
2025-05-07T20:32:35.5189495Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5189869Z         contiguous: bool,
2025-05-07T20:32:35.5190205Z         compiled: bool,
2025-05-07T20:32:35.5190516Z     ) -> None:
2025-05-07T20:32:35.5190807Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5191145Z     
2025-05-07T20:32:35.5191535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5192016Z     
2025-05-07T20:32:35.5192292Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5192713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5193154Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5193476Z         x0 = x[:, :D]
2025-05-07T20:32:35.5193769Z         x1 = x[:, D:]
2025-05-07T20:32:35.5194064Z     
2025-05-07T20:32:35.5194307Z         if contiguous:
2025-05-07T20:32:35.5194617Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5194970Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5195298Z     
2025-05-07T20:32:35.5195562Z         if scale_ub is not None:
2025-05-07T20:32:35.5195933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5196445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5196875Z             )
2025-05-07T20:32:35.5197143Z         else:
2025-05-07T20:32:35.5197420Z             scale_ub_tensor = None
2025-05-07T20:32:35.5197749Z     
2025-05-07T20:32:35.5198118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5198552Z             op = silu_mul_quant
2025-05-07T20:32:35.5198897Z             if compiled:
2025-05-07T20:32:35.5199230Z                 op = torch.compile(op)
2025-05-07T20:32:35.5199630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5200061Z     
2025-05-07T20:32:35.5200322Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5200706Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5201092Z     
2025-05-07T20:32:35.5201408Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5201851Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5202239Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5202665Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5203152Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5203565Z     
2025-05-07T20:32:35.5203835Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5204100Z 
2025-05-07T20:32:35.5204238Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5204632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5205084Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5205545Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5206886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5207974Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5208747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5209578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5210299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5211050Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5211894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5212566Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5213191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5213723Z     fn()
2025-05-07T20:32:35.5214407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5215016Z     self.fn.run(
2025-05-07T20:32:35.5215498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5216054Z     kernel = self.compile(
2025-05-07T20:32:35.5216612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5217290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5217696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5217938Z 
2025-05-07T20:32:35.5218154Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef6a6fc0>
2025-05-07T20:32:35.5219283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5220789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aef6c96c0>}
2025-05-07T20:32:35.5222180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5223300Z context = <triton._C.libtriton.ir.context object at 0x7f8aee29e2b0>
2025-05-07T20:32:35.5223603Z 
2025-05-07T20:32:35.5223774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5224309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5224854Z                            module_map=module_map)
2025-05-07T20:32:35.5225230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5225598Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5225868Z E       ^
2025-05-07T20:32:35.5226347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5226818Z 
2025-05-07T20:32:35.5227251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5227788Z 
2025-05-07T20:32:35.5227898Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5228323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5228737Z     T=16384,
2025-05-07T20:32:35.5228935Z     D=7168,
2025-05-07T20:32:35.5229126Z     scale_ub=1200.0,
2025-05-07T20:32:35.5229359Z     contiguous=False,
2025-05-07T20:32:35.5229593Z     compiled=False,
2025-05-07T20:32:35.5229797Z )
2025-05-07T20:32:35.5230124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5230647Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5230939Z 
2025-05-07T20:32:35.5231023Z     @given(
2025-05-07T20:32:35.5231258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5231582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5231899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5232234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5232580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5232873Z     )
2025-05-07T20:32:35.5233227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5233687Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5233938Z         self,
2025-05-07T20:32:35.5234143Z         T: int,
2025-05-07T20:32:35.5234344Z         D: int,
2025-05-07T20:32:35.5234571Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5234848Z         contiguous: bool,
2025-05-07T20:32:35.5235142Z         compiled: bool,
2025-05-07T20:32:35.5235372Z     ) -> None:
2025-05-07T20:32:35.5235591Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5235833Z     
2025-05-07T20:32:35.5236114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5236467Z     
2025-05-07T20:32:35.5236662Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5236960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5237280Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5237520Z         x0 = x[:, :D]
2025-05-07T20:32:35.5237744Z         x1 = x[:, D:]
2025-05-07T20:32:35.5237957Z     
2025-05-07T20:32:35.5238137Z         if contiguous:
2025-05-07T20:32:35.5238370Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5238635Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5238885Z     
2025-05-07T20:32:35.5239075Z         if scale_ub is not None:
2025-05-07T20:32:35.5239354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5239702Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5240013Z             )
2025-05-07T20:32:35.5240211Z         else:
2025-05-07T20:32:35.5240478Z             scale_ub_tensor = None
2025-05-07T20:32:35.5240733Z     
2025-05-07T20:32:35.5240971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5241295Z             op = silu_mul_quant
2025-05-07T20:32:35.5241617Z             if compiled:
2025-05-07T20:32:35.5241870Z                 op = torch.compile(op)
2025-05-07T20:32:35.5242175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5242454Z     
2025-05-07T20:32:35.5242650Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5242817Z 
2025-05-07T20:32:35.5242925Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5243277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5243629Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5243920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5244639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5245352Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5245907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5246617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5247304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5247859Z     kernel = self.compile(
2025-05-07T20:32:35.5248422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5249119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5249533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5249780Z 
2025-05-07T20:32:35.5249998Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef7d4620>
2025-05-07T20:32:35.5251147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5252643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57ce00>}
2025-05-07T20:32:35.5254086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5255181Z context = <triton._C.libtriton.ir.context object at 0x7f8aee2fa0b0>
2025-05-07T20:32:35.5255493Z 
2025-05-07T20:32:35.5255714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5256265Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5256752Z                            module_map=module_map)
2025-05-07T20:32:35.5257130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5257497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5257768Z E       ^
2025-05-07T20:32:35.5258247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5258735Z 
2025-05-07T20:32:35.5259175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5259721Z 
2025-05-07T20:32:35.5259834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5260269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5260684Z     T=1,
2025-05-07T20:32:35.5260874Z     D=7168,
2025-05-07T20:32:35.5261072Z     scale_ub=None,
2025-05-07T20:32:35.5261288Z     contiguous=True,
2025-05-07T20:32:35.5261517Z     compiled=True,
2025-05-07T20:32:35.5261781Z )
2025-05-07T20:32:35.5262111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5262619Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5262928Z 
2025-05-07T20:32:35.5263015Z     @given(
2025-05-07T20:32:35.5263247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5263571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5263886Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5264227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5264608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5264906Z     )
2025-05-07T20:32:35.5265277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5265733Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5265986Z         self,
2025-05-07T20:32:35.5266193Z         T: int,
2025-05-07T20:32:35.5266393Z         D: int,
2025-05-07T20:32:35.5266619Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5266903Z         contiguous: bool,
2025-05-07T20:32:35.5267148Z         compiled: bool,
2025-05-07T20:32:35.5267380Z     ) -> None:
2025-05-07T20:32:35.5267610Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5267853Z     
2025-05-07T20:32:35.5268134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5268491Z     
2025-05-07T20:32:35.5268686Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5268988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5269313Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5269567Z         x0 = x[:, :D]
2025-05-07T20:32:35.5269787Z         x1 = x[:, D:]
2025-05-07T20:32:35.5270001Z     
2025-05-07T20:32:35.5270199Z         if contiguous:
2025-05-07T20:32:35.5270433Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5270701Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5270949Z     
2025-05-07T20:32:35.5271144Z         if scale_ub is not None:
2025-05-07T20:32:35.5271429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5271778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5272101Z             )
2025-05-07T20:32:35.5272302Z         else:
2025-05-07T20:32:35.5272521Z             scale_ub_tensor = None
2025-05-07T20:32:35.5272797Z     
2025-05-07T20:32:35.5273059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5273385Z             op = silu_mul_quant
2025-05-07T20:32:35.5273637Z             if compiled:
2025-05-07T20:32:35.5273892Z                 op = torch.compile(op)
2025-05-07T20:32:35.5274197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5274483Z     
2025-05-07T20:32:35.5274674Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5275023Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5275326Z     
2025-05-07T20:32:35.5275572Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5275921Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5276229Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5276549Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5276926Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5277248Z     
2025-05-07T20:32:35.5277453Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5277660Z 
2025-05-07T20:32:35.5277760Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5278072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5278423Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5278757Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5279591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5280435Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5281004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5281724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5282491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5283255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5284018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5284736Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5285378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5285927Z     fn()
2025-05-07T20:32:35.5286456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5287074Z     self.fn.run(
2025-05-07T20:32:35.5287565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5288125Z     kernel = self.compile(
2025-05-07T20:32:35.5288699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5289395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5289818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5290058Z 
2025-05-07T20:32:35.5290279Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee423a10>
2025-05-07T20:32:35.5291435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5292966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee5ab600>}
2025-05-07T20:32:35.5294391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5295450Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9bc6970>
2025-05-07T20:32:35.5295762Z 
2025-05-07T20:32:35.5295933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5296528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5297014Z                            module_map=module_map)
2025-05-07T20:32:35.5297383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5297752Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5298024Z E       ^
2025-05-07T20:32:35.5298497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5298971Z 
2025-05-07T20:32:35.5299399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5299938Z 
2025-05-07T20:32:35.5300042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5300469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5300880Z     T=4096,
2025-05-07T20:32:35.5301074Z     D=5120,
2025-05-07T20:32:35.5301270Z     scale_ub=None,
2025-05-07T20:32:35.5301487Z     contiguous=False,
2025-05-07T20:32:35.5301725Z     compiled=False,
2025-05-07T20:32:35.5301935Z )
2025-05-07T20:32:35.5302263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5302829Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5303165Z 
2025-05-07T20:32:35.5303251Z     @given(
2025-05-07T20:32:35.5303485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5303844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5304165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5304505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5304838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5305140Z     )
2025-05-07T20:32:35.5305551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5306004Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5306513Z         self,
2025-05-07T20:32:35.5306744Z         T: int,
2025-05-07T20:32:35.5306953Z         D: int,
2025-05-07T20:32:35.5307172Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5307454Z         contiguous: bool,
2025-05-07T20:32:35.5307705Z         compiled: bool,
2025-05-07T20:32:35.5307931Z     ) -> None:
2025-05-07T20:32:35.5308152Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5308401Z     
2025-05-07T20:32:35.5308679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5309033Z     
2025-05-07T20:32:35.5309234Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5309528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5309849Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5310099Z         x0 = x[:, :D]
2025-05-07T20:32:35.5310319Z         x1 = x[:, D:]
2025-05-07T20:32:35.5310535Z     
2025-05-07T20:32:35.5310728Z         if contiguous:
2025-05-07T20:32:35.5310961Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5311229Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5311478Z     
2025-05-07T20:32:35.5311668Z         if scale_ub is not None:
2025-05-07T20:32:35.5311949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5312295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5312611Z             )
2025-05-07T20:32:35.5312803Z         else:
2025-05-07T20:32:35.5313022Z             scale_ub_tensor = None
2025-05-07T20:32:35.5313283Z     
2025-05-07T20:32:35.5313515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5313836Z             op = silu_mul_quant
2025-05-07T20:32:35.5314093Z             if compiled:
2025-05-07T20:32:35.5314341Z                 op = torch.compile(op)
2025-05-07T20:32:35.5314645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5314929Z     
2025-05-07T20:32:35.5315121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5315293Z 
2025-05-07T20:32:35.5315393Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5315791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5316138Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5316421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5317138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5317851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5318406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5319115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5319805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5320361Z     kernel = self.compile(
2025-05-07T20:32:35.5320915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5321598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5322008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5322351Z 
2025-05-07T20:32:35.5322569Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aef511850>
2025-05-07T20:32:35.5337464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5339092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee5ab420>}
2025-05-07T20:32:35.5340774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5341850Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9ea6bb0>
2025-05-07T20:32:35.5342152Z 
2025-05-07T20:32:35.5342328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5342874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5343361Z                            module_map=module_map)
2025-05-07T20:32:35.5343746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5344107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5344381Z E       ^
2025-05-07T20:32:35.5344866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5345337Z 
2025-05-07T20:32:35.5345770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5346310Z 
2025-05-07T20:32:35.5346423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5346854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5347270Z     T=4096,
2025-05-07T20:32:35.5347463Z     D=7168,
2025-05-07T20:32:35.5347664Z     scale_ub=None,
2025-05-07T20:32:35.5347887Z     contiguous=False,
2025-05-07T20:32:35.5348113Z     compiled=False,
2025-05-07T20:32:35.5348326Z )
2025-05-07T20:32:35.5348661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5349167Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5349456Z 
2025-05-07T20:32:35.5349535Z     @given(
2025-05-07T20:32:35.5349773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5350099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5350406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5350746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5351146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5351435Z     )
2025-05-07T20:32:35.5351798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5352250Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5352490Z         self,
2025-05-07T20:32:35.5352687Z         T: int,
2025-05-07T20:32:35.5352887Z         D: int,
2025-05-07T20:32:35.5353106Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5353381Z         contiguous: bool,
2025-05-07T20:32:35.5353627Z         compiled: bool,
2025-05-07T20:32:35.5353848Z     ) -> None:
2025-05-07T20:32:35.5354067Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5354317Z     
2025-05-07T20:32:35.5354599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5354953Z     
2025-05-07T20:32:35.5355151Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5355451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5355768Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5356011Z         x0 = x[:, :D]
2025-05-07T20:32:35.5356234Z         x1 = x[:, D:]
2025-05-07T20:32:35.5356439Z     
2025-05-07T20:32:35.5356676Z         if contiguous:
2025-05-07T20:32:35.5356912Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5357168Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5357411Z     
2025-05-07T20:32:35.5357605Z         if scale_ub is not None:
2025-05-07T20:32:35.5357922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5358263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5358583Z             )
2025-05-07T20:32:35.5358777Z         else:
2025-05-07T20:32:35.5358993Z             scale_ub_tensor = None
2025-05-07T20:32:35.5359295Z     
2025-05-07T20:32:35.5359530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5359848Z             op = silu_mul_quant
2025-05-07T20:32:35.5360104Z             if compiled:
2025-05-07T20:32:35.5360362Z                 op = torch.compile(op)
2025-05-07T20:32:35.5360657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5360935Z     
2025-05-07T20:32:35.5361132Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5361302Z 
2025-05-07T20:32:35.5361404Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5361706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5362051Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5362331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5363049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5363816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5364377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5365079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5365769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5366319Z     kernel = self.compile(
2025-05-07T20:32:35.5366881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5367555Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5367967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5368204Z 
2025-05-07T20:32:35.5368429Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee54f860>
2025-05-07T20:32:35.5369546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5371019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee58c360>}
2025-05-07T20:32:35.5372495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5373592Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9e1f470>
2025-05-07T20:32:35.5373916Z 
2025-05-07T20:32:35.5374091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5374623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5374736Z                            module_map=module_map)
2025-05-07T20:32:35.5374906Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5375007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5375091Z E       ^
2025-05-07T20:32:35.5375459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5375464Z 
2025-05-07T20:32:35.5375946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5375951Z 
2025-05-07T20:32:35.5376057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5376325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5376409Z     T=128,
2025-05-07T20:32:35.5376487Z     D=7168,
2025-05-07T20:32:35.5376571Z     scale_ub=None,
2025-05-07T20:32:35.5376664Z     contiguous=False,
2025-05-07T20:32:35.5376751Z     compiled=True,
2025-05-07T20:32:35.5376831Z )
2025-05-07T20:32:35.5377099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5377273Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5377277Z 
2025-05-07T20:32:35.5377362Z     @given(
2025-05-07T20:32:35.5377485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5377586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5377709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5377826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5377951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5378030Z     )
2025-05-07T20:32:35.5378283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5378381Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5378458Z         self,
2025-05-07T20:32:35.5378534Z         T: int,
2025-05-07T20:32:35.5378615Z         D: int,
2025-05-07T20:32:35.5378716Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5378806Z         contiguous: bool,
2025-05-07T20:32:35.5378898Z         compiled: bool,
2025-05-07T20:32:35.5378976Z     ) -> None:
2025-05-07T20:32:35.5379071Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5379153Z     
2025-05-07T20:32:35.5379324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5379400Z     
2025-05-07T20:32:35.5379504Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5379629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5379725Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5379807Z         x0 = x[:, :D]
2025-05-07T20:32:35.5379888Z         x1 = x[:, D:]
2025-05-07T20:32:35.5379969Z     
2025-05-07T20:32:35.5380056Z         if contiguous:
2025-05-07T20:32:35.5380151Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5380245Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5380316Z     
2025-05-07T20:32:35.5380406Z         if scale_ub is not None:
2025-05-07T20:32:35.5380519Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5380656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5380732Z             )
2025-05-07T20:32:35.5380815Z         else:
2025-05-07T20:32:35.5380957Z             scale_ub_tensor = None
2025-05-07T20:32:35.5381036Z     
2025-05-07T20:32:35.5381165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5381258Z             op = silu_mul_quant
2025-05-07T20:32:35.5381348Z             if compiled:
2025-05-07T20:32:35.5381449Z                 op = torch.compile(op)
2025-05-07T20:32:35.5381556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5381636Z     
2025-05-07T20:32:35.5381727Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5381846Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5381923Z     
2025-05-07T20:32:35.5382060Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5382166Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5382273Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5382395Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5382544Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5382617Z     
2025-05-07T20:32:35.5382717Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5382721Z 
2025-05-07T20:32:35.5382868Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5383000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5383105Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5383316Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5383900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5384006Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5384418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5384647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5385034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5385300Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5385690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5385865Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5386217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5386300Z     fn()
2025-05-07T20:32:35.5386714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5386800Z     self.fn.run(
2025-05-07T20:32:35.5387154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5387251Z     kernel = self.compile(
2025-05-07T20:32:35.5387650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5387832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5387963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5387968Z 
2025-05-07T20:32:35.5388184Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b12ff0>
2025-05-07T20:32:35.5388990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5389515Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5e7a0>}
2025-05-07T20:32:35.5390338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5390537Z context = <triton._C.libtriton.ir.context object at 0x7f8ac9c87cb0>
2025-05-07T20:32:35.5390542Z 
2025-05-07T20:32:35.5390714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5390988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5391099Z                            module_map=module_map)
2025-05-07T20:32:35.5391265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5391369Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5391454Z E       ^
2025-05-07T20:32:35.5391821Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5391826Z 
2025-05-07T20:32:35.5392256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5392266Z 
2025-05-07T20:32:35.5392412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5392643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5392744Z     T=128,
2025-05-07T20:32:35.5392827Z     D=7168,
2025-05-07T20:32:35.5392973Z     scale_ub=None,
2025-05-07T20:32:35.5393071Z     contiguous=False,
2025-05-07T20:32:35.5393159Z     compiled=False,
2025-05-07T20:32:35.5393231Z )
2025-05-07T20:32:35.5393461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5393640Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5393684Z 
2025-05-07T20:32:35.5393770Z     @given(
2025-05-07T20:32:35.5393893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5393995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5394117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5394237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5394357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5394438Z     )
2025-05-07T20:32:35.5394691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5394788Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5394871Z         self,
2025-05-07T20:32:35.5394949Z         T: int,
2025-05-07T20:32:35.5395025Z         D: int,
2025-05-07T20:32:35.5395130Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5395222Z         contiguous: bool,
2025-05-07T20:32:35.5395315Z         compiled: bool,
2025-05-07T20:32:35.5395397Z     ) -> None:
2025-05-07T20:32:35.5395492Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5395572Z     
2025-05-07T20:32:35.5395745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5395823Z     
2025-05-07T20:32:35.5395930Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5396056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5396151Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5396240Z         x0 = x[:, :D]
2025-05-07T20:32:35.5396321Z         x1 = x[:, D:]
2025-05-07T20:32:35.5396392Z     
2025-05-07T20:32:35.5396482Z         if contiguous:
2025-05-07T20:32:35.5396578Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5396674Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5396749Z     
2025-05-07T20:32:35.5396844Z         if scale_ub is not None:
2025-05-07T20:32:35.5396952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5397090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5397172Z             )
2025-05-07T20:32:35.5397254Z         else:
2025-05-07T20:32:35.5397352Z             scale_ub_tensor = None
2025-05-07T20:32:35.5397425Z     
2025-05-07T20:32:35.5397607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5397700Z             op = silu_mul_quant
2025-05-07T20:32:35.5397786Z             if compiled:
2025-05-07T20:32:35.5397894Z                 op = torch.compile(op)
2025-05-07T20:32:35.5398004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5398085Z     
2025-05-07T20:32:35.5398177Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5398181Z 
2025-05-07T20:32:35.5398284Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5398421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5398523Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5398624Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5399146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5399248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5399624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5399853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5400247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5400347Z     kernel = self.compile(
2025-05-07T20:32:35.5400741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5400956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5401094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5401098Z 
2025-05-07T20:32:35.5401351Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9f090d0>
2025-05-07T20:32:35.5402163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5402686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5e980>}
2025-05-07T20:32:35.5403493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5403716Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e2ae70>
2025-05-07T20:32:35.5403721Z 
2025-05-07T20:32:35.5403891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5404172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5404281Z                            module_map=module_map)
2025-05-07T20:32:35.5404448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5404555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5404634Z E       ^
2025-05-07T20:32:35.5405008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5405013Z 
2025-05-07T20:32:35.5405442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5405449Z 
2025-05-07T20:32:35.5405554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5405787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5405865Z     T=4096,
2025-05-07T20:32:35.5405945Z     D=5120,
2025-05-07T20:32:35.5406034Z     scale_ub=1200.0,
2025-05-07T20:32:35.5406118Z     contiguous=True,
2025-05-07T20:32:35.5406632Z     compiled=False,
2025-05-07T20:32:35.5406710Z )
2025-05-07T20:32:35.5407032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5407223Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.5407228Z 
2025-05-07T20:32:35.5407308Z     @given(
2025-05-07T20:32:35.5407430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5407538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5407653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5407779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5407897Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5407970Z     )
2025-05-07T20:32:35.5408229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5408330Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5408407Z         self,
2025-05-07T20:32:35.5408495Z         T: int,
2025-05-07T20:32:35.5408573Z         D: int,
2025-05-07T20:32:35.5408676Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5408777Z         contiguous: bool,
2025-05-07T20:32:35.5408864Z         compiled: bool,
2025-05-07T20:32:35.5408944Z     ) -> None:
2025-05-07T20:32:35.5409045Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5409184Z     
2025-05-07T20:32:35.5409358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5409441Z     
2025-05-07T20:32:35.5409533Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5409722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5409814Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5409898Z         x0 = x[:, :D]
2025-05-07T20:32:35.5409987Z         x1 = x[:, D:]
2025-05-07T20:32:35.5410062Z     
2025-05-07T20:32:35.5410147Z         if contiguous:
2025-05-07T20:32:35.5410309Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5410399Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5410473Z     
2025-05-07T20:32:35.5410574Z         if scale_ub is not None:
2025-05-07T20:32:35.5410686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5410823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5410907Z             )
2025-05-07T20:32:35.5410986Z         else:
2025-05-07T20:32:35.5411086Z             scale_ub_tensor = None
2025-05-07T20:32:35.5411168Z     
2025-05-07T20:32:35.5411298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5411395Z             op = silu_mul_quant
2025-05-07T20:32:35.5411486Z             if compiled:
2025-05-07T20:32:35.5411586Z                 op = torch.compile(op)
2025-05-07T20:32:35.5411697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5411771Z     
2025-05-07T20:32:35.5411916Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5411921Z 
2025-05-07T20:32:35.5412040Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5412174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5412277Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5412386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5412910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5413011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5413381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5413614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5413975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5414071Z     kernel = self.compile(
2025-05-07T20:32:35.5414473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5414655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5414837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5414841Z 
2025-05-07T20:32:35.5415061Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9f0ae70>
2025-05-07T20:32:35.5415871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5416400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9f5f9c0>}
2025-05-07T20:32:35.5417181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5417387Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e02770>
2025-05-07T20:32:35.5417391Z 
2025-05-07T20:32:35.5417572Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5417913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5418027Z                            module_map=module_map)
2025-05-07T20:32:35.5418192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5418331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5418415Z E       ^
2025-05-07T20:32:35.5418782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5418787Z 
2025-05-07T20:32:35.5419216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5419265Z 
2025-05-07T20:32:35.5419384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5419690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5419810Z     T=1,
2025-05-07T20:32:35.5419888Z     D=5120,
2025-05-07T20:32:35.5419973Z     scale_ub=None,
2025-05-07T20:32:35.5420063Z     contiguous=True,
2025-05-07T20:32:35.5420150Z     compiled=True,
2025-05-07T20:32:35.5420224Z )
2025-05-07T20:32:35.5420456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5420628Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5420636Z 
2025-05-07T20:32:35.5420718Z     @given(
2025-05-07T20:32:35.5420838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5420937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5421056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5421176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5421294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5421377Z     )
2025-05-07T20:32:35.5421631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5421726Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5421813Z         self,
2025-05-07T20:32:35.5421896Z         T: int,
2025-05-07T20:32:35.5421975Z         D: int,
2025-05-07T20:32:35.5422079Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5422170Z         contiguous: bool,
2025-05-07T20:32:35.5422268Z         compiled: bool,
2025-05-07T20:32:35.5422350Z     ) -> None:
2025-05-07T20:32:35.5422446Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5422527Z     
2025-05-07T20:32:35.5422698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5422772Z     
2025-05-07T20:32:35.5422876Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5423005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5423096Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5423182Z         x0 = x[:, :D]
2025-05-07T20:32:35.5423265Z         x1 = x[:, D:]
2025-05-07T20:32:35.5423338Z     
2025-05-07T20:32:35.5423485Z         if contiguous:
2025-05-07T20:32:35.5423583Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5423679Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5423757Z     
2025-05-07T20:32:35.5424431Z         if scale_ub is not None:
2025-05-07T20:32:35.5424545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5424682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5424762Z             )
2025-05-07T20:32:35.5424848Z         else:
2025-05-07T20:32:35.5424946Z             scale_ub_tensor = None
2025-05-07T20:32:35.5425019Z     
2025-05-07T20:32:35.5425153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5425245Z             op = silu_mul_quant
2025-05-07T20:32:35.5425336Z             if compiled:
2025-05-07T20:32:35.5425443Z                 op = torch.compile(op)
2025-05-07T20:32:35.5425550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5425627Z     
2025-05-07T20:32:35.5425724Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5425846Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5425927Z     
2025-05-07T20:32:35.5426113Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5426217Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5426324Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5426488Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5426632Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5426712Z     
2025-05-07T20:32:35.5426814Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5426818Z 
2025-05-07T20:32:35.5426924Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5427096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5427202Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5427346Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5427928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5428033Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5428412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5428646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5429031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5429298Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5429694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5429872Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5430232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5430318Z     fn()
2025-05-07T20:32:35.5430738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5430824Z     self.fn.run(
2025-05-07T20:32:35.5431180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5431279Z     kernel = self.compile(
2025-05-07T20:32:35.5431674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5431860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5431998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5432002Z 
2025-05-07T20:32:35.5432218Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9d48080>
2025-05-07T20:32:35.5433073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5433601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57c400>}
2025-05-07T20:32:35.5434388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5434588Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8e70070>
2025-05-07T20:32:35.5434595Z 
2025-05-07T20:32:35.5434771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5435045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5435155Z                            module_map=module_map)
2025-05-07T20:32:35.5435367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5435479Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5435563Z E       ^
2025-05-07T20:32:35.5435931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5435975Z 
2025-05-07T20:32:35.5436407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5436412Z 
2025-05-07T20:32:35.5436525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5436797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5436882Z     T=2048,
2025-05-07T20:32:35.5436959Z     D=5120,
2025-05-07T20:32:35.5437043Z     scale_ub=None,
2025-05-07T20:32:35.5437141Z     contiguous=True,
2025-05-07T20:32:35.5437226Z     compiled=True,
2025-05-07T20:32:35.5437301Z )
2025-05-07T20:32:35.5437534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5437715Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5437720Z 
2025-05-07T20:32:35.5437800Z     @given(
2025-05-07T20:32:35.5437930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5438036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5438161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5438286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5438403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5438491Z     )
2025-05-07T20:32:35.5438745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5438842Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5438927Z         self,
2025-05-07T20:32:35.5439009Z         T: int,
2025-05-07T20:32:35.5439090Z         D: int,
2025-05-07T20:32:35.5439196Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5439290Z         contiguous: bool,
2025-05-07T20:32:35.5439380Z         compiled: bool,
2025-05-07T20:32:35.5439468Z     ) -> None:
2025-05-07T20:32:35.5439569Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5439646Z     
2025-05-07T20:32:35.5439835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5439912Z     
2025-05-07T20:32:35.5440012Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5440141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5440234Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5440323Z         x0 = x[:, :D]
2025-05-07T20:32:35.5440409Z         x1 = x[:, D:]
2025-05-07T20:32:35.5440484Z     
2025-05-07T20:32:35.5440577Z         if contiguous:
2025-05-07T20:32:35.5440671Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5440813Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5440896Z     
2025-05-07T20:32:35.5440990Z         if scale_ub is not None:
2025-05-07T20:32:35.5441098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5441243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5441327Z             )
2025-05-07T20:32:35.5441416Z         else:
2025-05-07T20:32:35.5441516Z             scale_ub_tensor = None
2025-05-07T20:32:35.5441591Z     
2025-05-07T20:32:35.5441728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5441821Z             op = silu_mul_quant
2025-05-07T20:32:35.5441908Z             if compiled:
2025-05-07T20:32:35.5442012Z                 op = torch.compile(op)
2025-05-07T20:32:35.5442118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5442196Z     
2025-05-07T20:32:35.5442293Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5442417Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5442490Z     
2025-05-07T20:32:35.5442637Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5442740Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5442894Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5443022Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5443167Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5443290Z     
2025-05-07T20:32:35.5443393Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5443397Z 
2025-05-07T20:32:35.5443497Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5443636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5443744Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5443927Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5444507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5444610Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5444987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5445216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5445593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5445864Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5446252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5446435Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5446791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5446868Z     fn()
2025-05-07T20:32:35.5447290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5447378Z     self.fn.run(
2025-05-07T20:32:35.5447727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5447830Z     kernel = self.compile(
2025-05-07T20:32:35.5448226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5448412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5448544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5448552Z 
2025-05-07T20:32:35.5448762Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9d3e270>
2025-05-07T20:32:35.5449647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5450169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac9d07ba0>}
2025-05-07T20:32:35.5450946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5451144Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8df6670>
2025-05-07T20:32:35.5451149Z 
2025-05-07T20:32:35.5451323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5451599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5451708Z                            module_map=module_map)
2025-05-07T20:32:35.5451941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5452048Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5452127Z E       ^
2025-05-07T20:32:35.5452547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5452552Z 
2025-05-07T20:32:35.5452982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5453023Z 
2025-05-07T20:32:35.5453136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5453366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5453444Z     T=128,
2025-05-07T20:32:35.5453565Z     D=5120,
2025-05-07T20:32:35.5453650Z     scale_ub=None,
2025-05-07T20:32:35.5453735Z     contiguous=True,
2025-05-07T20:32:35.5453827Z     compiled=True,
2025-05-07T20:32:35.5453901Z )
2025-05-07T20:32:35.5454130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5454309Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5454313Z 
2025-05-07T20:32:35.5454398Z     @given(
2025-05-07T20:32:35.5454523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5454623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5454738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5454864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5454978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5455054Z     )
2025-05-07T20:32:35.5455309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5455405Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5455489Z         self,
2025-05-07T20:32:35.5455565Z         T: int,
2025-05-07T20:32:35.5455643Z         D: int,
2025-05-07T20:32:35.5455747Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5455838Z         contiguous: bool,
2025-05-07T20:32:35.5455924Z         compiled: bool,
2025-05-07T20:32:35.5456008Z     ) -> None:
2025-05-07T20:32:35.5456110Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5456187Z     
2025-05-07T20:32:35.5456364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5456439Z     
2025-05-07T20:32:35.5456534Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5456667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5456756Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5456838Z         x0 = x[:, :D]
2025-05-07T20:32:35.5456924Z         x1 = x[:, D:]
2025-05-07T20:32:35.5456999Z     
2025-05-07T20:32:35.5457089Z         if contiguous:
2025-05-07T20:32:35.5457184Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5457276Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5457353Z     
2025-05-07T20:32:35.5457444Z         if scale_ub is not None:
2025-05-07T20:32:35.5457597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5457744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5457820Z             )
2025-05-07T20:32:35.5457899Z         else:
2025-05-07T20:32:35.5457998Z             scale_ub_tensor = None
2025-05-07T20:32:35.5458071Z     
2025-05-07T20:32:35.5458200Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5458302Z             op = silu_mul_quant
2025-05-07T20:32:35.5458388Z             if compiled:
2025-05-07T20:32:35.5458494Z                 op = torch.compile(op)
2025-05-07T20:32:35.5458600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5458672Z     
2025-05-07T20:32:35.5458772Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5458897Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5458970Z     
2025-05-07T20:32:35.5459112Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5459215Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5459317Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5459445Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5459632Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5459710Z     
2025-05-07T20:32:35.5459811Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5459816Z 
2025-05-07T20:32:35.5459954Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5460092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5460195Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5460330Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5460914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5461058Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5461438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5461674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5462056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5462329Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5462722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5462895Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5463256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5463337Z     fn()
2025-05-07T20:32:35.5463757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5463841Z     self.fn.run(
2025-05-07T20:32:35.5464194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5464298Z     kernel = self.compile(
2025-05-07T20:32:35.5464695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5464875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5465014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5465019Z 
2025-05-07T20:32:35.5465233Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee5b8500>
2025-05-07T20:32:35.5466049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5466621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac90c9300>}
2025-05-07T20:32:35.5467409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5467610Z context = <triton._C.libtriton.ir.context object at 0x7f8ac936c730>
2025-05-07T20:32:35.5467614Z 
2025-05-07T20:32:35.5467785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5468065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5468179Z                            module_map=module_map)
2025-05-07T20:32:35.5468351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5468458Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5468542Z E       ^
2025-05-07T20:32:35.5468918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5468922Z 
2025-05-07T20:32:35.5469398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5469403Z 
2025-05-07T20:32:35.5469510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5469786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5469868Z     T=4096,
2025-05-07T20:32:35.5469952Z     D=5120,
2025-05-07T20:32:35.5470038Z     scale_ub=None,
2025-05-07T20:32:35.5470124Z     contiguous=True,
2025-05-07T20:32:35.5470215Z     compiled=True,
2025-05-07T20:32:35.5470333Z )
2025-05-07T20:32:35.5470563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5470750Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5470758Z 
2025-05-07T20:32:35.5476489Z     @given(
2025-05-07T20:32:35.5476630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5476745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5476859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5476981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5477095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5477173Z     )
2025-05-07T20:32:35.5477435Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5477530Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5477605Z         self,
2025-05-07T20:32:35.5477687Z         T: int,
2025-05-07T20:32:35.5477767Z         D: int,
2025-05-07T20:32:35.5477864Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5477958Z         contiguous: bool,
2025-05-07T20:32:35.5478044Z         compiled: bool,
2025-05-07T20:32:35.5478128Z     ) -> None:
2025-05-07T20:32:35.5478224Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5478298Z     
2025-05-07T20:32:35.5478477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5478555Z     
2025-05-07T20:32:35.5478650Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5478783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5478872Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5478956Z         x0 = x[:, :D]
2025-05-07T20:32:35.5479042Z         x1 = x[:, D:]
2025-05-07T20:32:35.5479117Z     
2025-05-07T20:32:35.5479201Z         if contiguous:
2025-05-07T20:32:35.5479297Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5479387Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5479461Z     
2025-05-07T20:32:35.5479563Z         if scale_ub is not None:
2025-05-07T20:32:35.5479671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5479813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5479890Z             )
2025-05-07T20:32:35.5480036Z         else:
2025-05-07T20:32:35.5480135Z             scale_ub_tensor = None
2025-05-07T20:32:35.5480210Z     
2025-05-07T20:32:35.5480346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5480442Z             op = silu_mul_quant
2025-05-07T20:32:35.5480529Z             if compiled:
2025-05-07T20:32:35.5480630Z                 op = torch.compile(op)
2025-05-07T20:32:35.5480741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5480816Z     
2025-05-07T20:32:35.5480909Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5481034Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5481108Z     
2025-05-07T20:32:35.5481251Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5481361Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5481461Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5481590Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5481736Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5481810Z     
2025-05-07T20:32:35.5481962Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5481968Z 
2025-05-07T20:32:35.5482071Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5482208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5482359Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5482497Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5483143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5483244Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5483686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5483924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5484310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5484582Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5484975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5485149Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5485507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5485583Z     fn()
2025-05-07T20:32:35.5486000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5486089Z     self.fn.run(
2025-05-07T20:32:35.5486442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5486543Z     kernel = self.compile(
2025-05-07T20:32:35.5486943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5487123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5487259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5487267Z 
2025-05-07T20:32:35.5487479Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9265c70>
2025-05-07T20:32:35.5488298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5488822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac902a340>}
2025-05-07T20:32:35.5489650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5489852Z context = <triton._C.libtriton.ir.context object at 0x7f8ac97ffcf0>
2025-05-07T20:32:35.5489857Z 
2025-05-07T20:32:35.5490024Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5490301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5490411Z                            module_map=module_map)
2025-05-07T20:32:35.5490577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5490685Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5490761Z E       ^
2025-05-07T20:32:35.5491132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5491139Z 
2025-05-07T20:32:35.5491565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5491613Z 
2025-05-07T20:32:35.5491719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5492011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5492132Z     T=16384,
2025-05-07T20:32:35.5492206Z     D=5120,
2025-05-07T20:32:35.5492296Z     scale_ub=None,
2025-05-07T20:32:35.5492378Z     contiguous=True,
2025-05-07T20:32:35.5492466Z     compiled=True,
2025-05-07T20:32:35.5492539Z )
2025-05-07T20:32:35.5492765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5492992Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5492997Z 
2025-05-07T20:32:35.5493073Z     @given(
2025-05-07T20:32:35.5493195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5493311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5493427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5493547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5493666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5493742Z     )
2025-05-07T20:32:35.5494000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5494099Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5494177Z         self,
2025-05-07T20:32:35.5494264Z         T: int,
2025-05-07T20:32:35.5494345Z         D: int,
2025-05-07T20:32:35.5494446Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5494538Z         contiguous: bool,
2025-05-07T20:32:35.5494627Z         compiled: bool,
2025-05-07T20:32:35.5494706Z     ) -> None:
2025-05-07T20:32:35.5494803Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5494877Z     
2025-05-07T20:32:35.5495052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5495130Z     
2025-05-07T20:32:35.5495224Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5495357Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5495449Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5495530Z         x0 = x[:, :D]
2025-05-07T20:32:35.5495615Z         x1 = x[:, D:]
2025-05-07T20:32:35.5495686Z     
2025-05-07T20:32:35.5495773Z         if contiguous:
2025-05-07T20:32:35.5495870Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5495959Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5496033Z     
2025-05-07T20:32:35.5496131Z         if scale_ub is not None:
2025-05-07T20:32:35.5496237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5496378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5496461Z             )
2025-05-07T20:32:35.5496539Z         else:
2025-05-07T20:32:35.5496636Z             scale_ub_tensor = None
2025-05-07T20:32:35.5496715Z     
2025-05-07T20:32:35.5496892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5496991Z             op = silu_mul_quant
2025-05-07T20:32:35.5497076Z             if compiled:
2025-05-07T20:32:35.5497179Z                 op = torch.compile(op)
2025-05-07T20:32:35.5497288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5497360Z     
2025-05-07T20:32:35.5497451Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5497580Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5497655Z     
2025-05-07T20:32:35.5497791Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5497900Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5498002Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5498132Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5498273Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5498348Z     
2025-05-07T20:32:35.5498454Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5498458Z 
2025-05-07T20:32:35.5498560Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5498734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5498846Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5498980Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5499600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5499710Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5500078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5500349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5500727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5500989Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5501382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5501551Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5501909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5501992Z     fn()
2025-05-07T20:32:35.5502405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5502497Z     self.fn.run(
2025-05-07T20:32:35.5502880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5502996Z     kernel = self.compile(
2025-05-07T20:32:35.5503393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5503573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5503715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5503719Z 
2025-05-07T20:32:35.5503928Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8f67b60>
2025-05-07T20:32:35.5504737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5505261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac87a6d40>}
2025-05-07T20:32:35.5506080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5506543Z context = <triton._C.libtriton.ir.context object at 0x7f8ac88d5ab0>
2025-05-07T20:32:35.5506555Z 
2025-05-07T20:32:35.5506782Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5507064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5507179Z                            module_map=module_map)
2025-05-07T20:32:35.5507348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5507458Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5507537Z E       ^
2025-05-07T20:32:35.5507909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5507916Z 
2025-05-07T20:32:35.5508355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5508360Z 
2025-05-07T20:32:35.5508467Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5508793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5508874Z     T=1,
2025-05-07T20:32:35.5508951Z     D=5120,
2025-05-07T20:32:35.5509043Z     scale_ub=1200.0,
2025-05-07T20:32:35.5509129Z     contiguous=True,
2025-05-07T20:32:35.5509269Z     compiled=True,
2025-05-07T20:32:35.5509348Z )
2025-05-07T20:32:35.5509575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5509746Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5509756Z 
2025-05-07T20:32:35.5509836Z     @given(
2025-05-07T20:32:35.5510019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5510125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5510244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5510367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5510490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5510569Z     )
2025-05-07T20:32:35.5510825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5510925Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5511003Z         self,
2025-05-07T20:32:35.5511085Z         T: int,
2025-05-07T20:32:35.5511168Z         D: int,
2025-05-07T20:32:35.5511269Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5511366Z         contiguous: bool,
2025-05-07T20:32:35.5511454Z         compiled: bool,
2025-05-07T20:32:35.5511537Z     ) -> None:
2025-05-07T20:32:35.5511639Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5511718Z     
2025-05-07T20:32:35.5511890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5511970Z     
2025-05-07T20:32:35.5512064Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5512191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5512286Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5512369Z         x0 = x[:, :D]
2025-05-07T20:32:35.5512454Z         x1 = x[:, D:]
2025-05-07T20:32:35.5512535Z     
2025-05-07T20:32:35.5512620Z         if contiguous:
2025-05-07T20:32:35.5512714Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5512811Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5512887Z     
2025-05-07T20:32:35.5512983Z         if scale_ub is not None:
2025-05-07T20:32:35.5513090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5513229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5513310Z             )
2025-05-07T20:32:35.5513387Z         else:
2025-05-07T20:32:35.5513482Z             scale_ub_tensor = None
2025-05-07T20:32:35.5513558Z     
2025-05-07T20:32:35.5513687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5513779Z             op = silu_mul_quant
2025-05-07T20:32:35.5513937Z             if compiled:
2025-05-07T20:32:35.5514039Z                 op = torch.compile(op)
2025-05-07T20:32:35.5514149Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5514232Z     
2025-05-07T20:32:35.5514325Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5514329Z 
2025-05-07T20:32:35.5514429Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5514561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5514666Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5514773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5515149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5515245Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5515757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5515860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5516228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5516499Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5516851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5517013Z     kernel = self.compile(
2025-05-07T20:32:35.5517409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5517594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5517725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5517768Z 
2025-05-07T20:32:35.5517977Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8f64710>
2025-05-07T20:32:35.5518789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5519311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a19e0>}
2025-05-07T20:32:35.5520091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5520324Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8c40cf0>
2025-05-07T20:32:35.5520331Z 
2025-05-07T20:32:35.5520567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5520841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5520953Z                            module_map=module_map)
2025-05-07T20:32:35.5521121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5521224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5521305Z E       ^
2025-05-07T20:32:35.5521673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5521681Z 
2025-05-07T20:32:35.5522108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5522113Z 
2025-05-07T20:32:35.5522221Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5522448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5522528Z     T=1,
2025-05-07T20:32:35.5522611Z     D=5120,
2025-05-07T20:32:35.5522700Z     scale_ub=None,
2025-05-07T20:32:35.5522787Z     contiguous=False,
2025-05-07T20:32:35.5522885Z     compiled=True,
2025-05-07T20:32:35.5522975Z )
2025-05-07T20:32:35.5523377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5523553Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5523558Z 
2025-05-07T20:32:35.5523636Z     @given(
2025-05-07T20:32:35.5523758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5523866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5523983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5524107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5524222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5524298Z     )
2025-05-07T20:32:35.5524556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5524655Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5524733Z         self,
2025-05-07T20:32:35.5524816Z         T: int,
2025-05-07T20:32:35.5524894Z         D: int,
2025-05-07T20:32:35.5524995Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5525089Z         contiguous: bool,
2025-05-07T20:32:35.5525176Z         compiled: bool,
2025-05-07T20:32:35.5525304Z     ) -> None:
2025-05-07T20:32:35.5525401Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5525472Z     
2025-05-07T20:32:35.5525646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5525761Z     
2025-05-07T20:32:35.5525856Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5525988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5526078Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5526160Z         x0 = x[:, :D]
2025-05-07T20:32:35.5526249Z         x1 = x[:, D:]
2025-05-07T20:32:35.5526365Z     
2025-05-07T20:32:35.5526450Z         if contiguous:
2025-05-07T20:32:35.5526550Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5526645Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5526718Z     
2025-05-07T20:32:35.5526818Z         if scale_ub is not None:
2025-05-07T20:32:35.5526923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5527062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5527143Z             )
2025-05-07T20:32:35.5527223Z         else:
2025-05-07T20:32:35.5527323Z             scale_ub_tensor = None
2025-05-07T20:32:35.5527396Z     
2025-05-07T20:32:35.5527524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5527623Z             op = silu_mul_quant
2025-05-07T20:32:35.5527710Z             if compiled:
2025-05-07T20:32:35.5527810Z                 op = torch.compile(op)
2025-05-07T20:32:35.5527922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5527995Z     
2025-05-07T20:32:35.5528090Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5528217Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5528288Z     
2025-05-07T20:32:35.5528432Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5528537Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5528639Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5528769Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5528912Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5528988Z     
2025-05-07T20:32:35.5529097Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5529103Z 
2025-05-07T20:32:35.5529201Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5529333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5529446Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5529581Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5530168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5530270Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5530688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5530926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5531307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5531580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5532072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5532247Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5532609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5532690Z     fn()
2025-05-07T20:32:35.5533108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5533196Z     self.fn.run(
2025-05-07T20:32:35.5533594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5533695Z     kernel = self.compile(
2025-05-07T20:32:35.5534092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5534312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5534449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5534454Z 
2025-05-07T20:32:35.5534666Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8394050>
2025-05-07T20:32:35.5535522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5536049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a0680>}
2025-05-07T20:32:35.5536825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5537030Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8c1fc70>
2025-05-07T20:32:35.5537034Z 
2025-05-07T20:32:35.5537206Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5537481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5537594Z                            module_map=module_map)
2025-05-07T20:32:35.5537760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5537873Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5537952Z E       ^
2025-05-07T20:32:35.5538323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5538333Z 
2025-05-07T20:32:35.5538766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5538773Z 
2025-05-07T20:32:35.5538880Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5539114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5539194Z     T=1,
2025-05-07T20:32:35.5539275Z     D=5120,
2025-05-07T20:32:35.5539366Z     scale_ub=None,
2025-05-07T20:32:35.5539453Z     contiguous=True,
2025-05-07T20:32:35.5539543Z     compiled=False,
2025-05-07T20:32:35.5539622Z )
2025-05-07T20:32:35.5539849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5540073Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.5540078Z 
2025-05-07T20:32:35.5540165Z     @given(
2025-05-07T20:32:35.5540291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5540397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5540514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5540632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5540755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5540833Z     )
2025-05-07T20:32:35.5541088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5541188Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5541268Z         self,
2025-05-07T20:32:35.5541356Z         T: int,
2025-05-07T20:32:35.5541437Z         D: int,
2025-05-07T20:32:35.5541537Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5541634Z         contiguous: bool,
2025-05-07T20:32:35.5541723Z         compiled: bool,
2025-05-07T20:32:35.5541807Z     ) -> None:
2025-05-07T20:32:35.5541908Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5541982Z     
2025-05-07T20:32:35.5542200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5542281Z     
2025-05-07T20:32:35.5542378Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5542506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5542641Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5542722Z         x0 = x[:, :D]
2025-05-07T20:32:35.5542803Z         x1 = x[:, D:]
2025-05-07T20:32:35.5542882Z     
2025-05-07T20:32:35.5542968Z         if contiguous:
2025-05-07T20:32:35.5543069Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5543164Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5543282Z     
2025-05-07T20:32:35.5543379Z         if scale_ub is not None:
2025-05-07T20:32:35.5543487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5543631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5543715Z             )
2025-05-07T20:32:35.5543796Z         else:
2025-05-07T20:32:35.5543892Z             scale_ub_tensor = None
2025-05-07T20:32:35.5543971Z     
2025-05-07T20:32:35.5544101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5544192Z             op = silu_mul_quant
2025-05-07T20:32:35.5544280Z             if compiled:
2025-05-07T20:32:35.5544384Z                 op = torch.compile(op)
2025-05-07T20:32:35.5544494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5544569Z     
2025-05-07T20:32:35.5544664Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5544668Z 
2025-05-07T20:32:35.5544771Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5544903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5545007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5545112Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5545627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5545736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5546109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5546339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5546697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5546793Z     kernel = self.compile(
2025-05-07T20:32:35.5547185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5547371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5547503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5547507Z 
2025-05-07T20:32:35.5547768Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9013bc0>
2025-05-07T20:32:35.5548574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5549091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a2b60>}
2025-05-07T20:32:35.5549870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5550067Z context = <triton._C.libtriton.ir.context object at 0x7f8917f166f0>
2025-05-07T20:32:35.5550071Z 
2025-05-07T20:32:35.5550248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5550517Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5550691Z                            module_map=module_map)
2025-05-07T20:32:35.5550860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5550960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5551043Z E       ^
2025-05-07T20:32:35.5551450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5551454Z 
2025-05-07T20:32:35.5551880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5551923Z 
2025-05-07T20:32:35.5552031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5552259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5552342Z     T=128,
2025-05-07T20:32:35.5552418Z     D=5120,
2025-05-07T20:32:35.5552505Z     scale_ub=None,
2025-05-07T20:32:35.5552594Z     contiguous=False,
2025-05-07T20:32:35.5552677Z     compiled=True,
2025-05-07T20:32:35.5552750Z )
2025-05-07T20:32:35.5552985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5553164Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5553171Z 
2025-05-07T20:32:35.5553248Z     @given(
2025-05-07T20:32:35.5553371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5553470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5553589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5553706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5553822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5553898Z     )
2025-05-07T20:32:35.5554148Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5554243Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5554325Z         self,
2025-05-07T20:32:35.5554402Z         T: int,
2025-05-07T20:32:35.5554476Z         D: int,
2025-05-07T20:32:35.5554579Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5554667Z         contiguous: bool,
2025-05-07T20:32:35.5554753Z         compiled: bool,
2025-05-07T20:32:35.5554841Z     ) -> None:
2025-05-07T20:32:35.5554938Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5555014Z     
2025-05-07T20:32:35.5555184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5555257Z     
2025-05-07T20:32:35.5555353Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5555480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5555571Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5555654Z         x0 = x[:, :D]
2025-05-07T20:32:35.5555733Z         x1 = x[:, D:]
2025-05-07T20:32:35.5555810Z     
2025-05-07T20:32:35.5555924Z         if contiguous:
2025-05-07T20:32:35.5556107Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5556213Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5556287Z     
2025-05-07T20:32:35.5556380Z         if scale_ub is not None:
2025-05-07T20:32:35.5556487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5556629Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5556704Z             )
2025-05-07T20:32:35.5556785Z         else:
2025-05-07T20:32:35.5556880Z             scale_ub_tensor = None
2025-05-07T20:32:35.5556951Z     
2025-05-07T20:32:35.5557085Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5557175Z             op = silu_mul_quant
2025-05-07T20:32:35.5557258Z             if compiled:
2025-05-07T20:32:35.5557365Z                 op = torch.compile(op)
2025-05-07T20:32:35.5557468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5557539Z     
2025-05-07T20:32:35.5557633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5557638Z 
2025-05-07T20:32:35.5557737Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5557870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5558020Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5558122Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5558504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5558637Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5559149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5559251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5559620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5559891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5560245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5560342Z     kernel = self.compile(
2025-05-07T20:32:35.5560747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5560925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5561060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5561069Z 
2025-05-07T20:32:35.5561280Z self = <triton.compiler.compiler.ASTSource object at 0x7f8aee5b9640>
2025-05-07T20:32:35.5562084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5562611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac82a2de0>}
2025-05-07T20:32:35.5563433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5563630Z context = <triton._C.libtriton.ir.context object at 0x7f8917f10530>
2025-05-07T20:32:35.5563637Z 
2025-05-07T20:32:35.5563804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5564073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5564182Z                            module_map=module_map)
2025-05-07T20:32:35.5564347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5564450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5564525Z E       ^
2025-05-07T20:32:35.5564935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5564940Z 
2025-05-07T20:32:35.5565378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5565382Z 
2025-05-07T20:32:35.5565486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5565716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5565802Z     T=128,
2025-05-07T20:32:35.5565877Z     D=7168,
2025-05-07T20:32:35.5565961Z     scale_ub=1200.0,
2025-05-07T20:32:35.5566047Z     contiguous=False,
2025-05-07T20:32:35.5566130Z     compiled=False,
2025-05-07T20:32:35.5566206Z )
2025-05-07T20:32:35.5566430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5566612Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5566616Z 
2025-05-07T20:32:35.5566693Z     @given(
2025-05-07T20:32:35.5566815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5566912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5567073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5567190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5567308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5567382Z     )
2025-05-07T20:32:35.5567674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5567770Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5567845Z         self,
2025-05-07T20:32:35.5567923Z         T: int,
2025-05-07T20:32:35.5568002Z         D: int,
2025-05-07T20:32:35.5568098Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5568229Z         contiguous: bool,
2025-05-07T20:32:35.5568320Z         compiled: bool,
2025-05-07T20:32:35.5568398Z     ) -> None:
2025-05-07T20:32:35.5568492Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5568567Z     
2025-05-07T20:32:35.5568739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5568815Z     
2025-05-07T20:32:35.5568908Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5569033Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5569125Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5569206Z         x0 = x[:, :D]
2025-05-07T20:32:35.5569288Z         x1 = x[:, D:]
2025-05-07T20:32:35.5569363Z     
2025-05-07T20:32:35.5569446Z         if contiguous:
2025-05-07T20:32:35.5569538Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5569630Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5569702Z     
2025-05-07T20:32:35.5569791Z         if scale_ub is not None:
2025-05-07T20:32:35.5569903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5570039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5570120Z             )
2025-05-07T20:32:35.5570197Z         else:
2025-05-07T20:32:35.5570293Z             scale_ub_tensor = None
2025-05-07T20:32:35.5570371Z     
2025-05-07T20:32:35.5570498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5570588Z             op = silu_mul_quant
2025-05-07T20:32:35.5570677Z             if compiled:
2025-05-07T20:32:35.5570778Z                 op = torch.compile(op)
2025-05-07T20:32:35.5570881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5570960Z     
2025-05-07T20:32:35.5571052Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5571057Z 
2025-05-07T20:32:35.5571152Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5571288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5571389Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5571493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5572089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5572241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5572624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5572858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5573212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5573315Z     kernel = self.compile(
2025-05-07T20:32:35.5573713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5573897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5574032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5574037Z 
2025-05-07T20:32:35.5574249Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b10bc0>
2025-05-07T20:32:35.5575110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5575637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee57ccc0>}
2025-05-07T20:32:35.5576462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5576657Z context = <triton._C.libtriton.ir.context object at 0x7f8917f66730>
2025-05-07T20:32:35.5576699Z 
2025-05-07T20:32:35.5576878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5577152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5577261Z                            module_map=module_map)
2025-05-07T20:32:35.5577427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5577535Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5577613Z E       ^
2025-05-07T20:32:35.5577988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5577995Z 
2025-05-07T20:32:35.5578425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5578429Z 
2025-05-07T20:32:35.5578537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5578768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5578850Z     T=128,
2025-05-07T20:32:35.5578932Z     D=5120,
2025-05-07T20:32:35.5579015Z     scale_ub=None,
2025-05-07T20:32:35.5579103Z     contiguous=False,
2025-05-07T20:32:35.5579195Z     compiled=False,
2025-05-07T20:32:35.5579267Z )
2025-05-07T20:32:35.5579495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5579682Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5579686Z 
2025-05-07T20:32:35.5579764Z     @given(
2025-05-07T20:32:35.5579891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5579993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5580108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5580231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5580346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5580421Z     )
2025-05-07T20:32:35.5580680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5580776Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5580857Z         self,
2025-05-07T20:32:35.5580936Z         T: int,
2025-05-07T20:32:35.5581085Z         D: int,
2025-05-07T20:32:35.5581188Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5581277Z         contiguous: bool,
2025-05-07T20:32:35.5581365Z         compiled: bool,
2025-05-07T20:32:35.5581451Z     ) -> None:
2025-05-07T20:32:35.5581547Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5581621Z     
2025-05-07T20:32:35.5581798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5581876Z     
2025-05-07T20:32:35.5581967Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5582100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5582191Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5582270Z         x0 = x[:, :D]
2025-05-07T20:32:35.5582356Z         x1 = x[:, D:]
2025-05-07T20:32:35.5582430Z     
2025-05-07T20:32:35.5582516Z         if contiguous:
2025-05-07T20:32:35.5582608Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5582699Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5582779Z     
2025-05-07T20:32:35.5582883Z         if scale_ub is not None:
2025-05-07T20:32:35.5583003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5583214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5583291Z             )
2025-05-07T20:32:35.5583367Z         else:
2025-05-07T20:32:35.5583462Z             scale_ub_tensor = None
2025-05-07T20:32:35.5583575Z     
2025-05-07T20:32:35.5583705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5583800Z             op = silu_mul_quant
2025-05-07T20:32:35.5583883Z             if compiled:
2025-05-07T20:32:35.5583986Z                 op = torch.compile(op)
2025-05-07T20:32:35.5584089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5584201Z     
2025-05-07T20:32:35.5584293Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5584298Z 
2025-05-07T20:32:35.5584396Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5584532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5584636Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5584734Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5585251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5585359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5585731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5585969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5586322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5586420Z     kernel = self.compile(
2025-05-07T20:32:35.5586818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5587001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5587140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5587146Z 
2025-05-07T20:32:35.5587356Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac9b11130>
2025-05-07T20:32:35.5588167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5588696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8aee58cea0>}
2025-05-07T20:32:35.5589476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5589723Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80401f0>
2025-05-07T20:32:35.5589728Z 
2025-05-07T20:32:35.5589904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5590176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5590287Z                            module_map=module_map)
2025-05-07T20:32:35.5590454Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5590558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5590635Z E       ^
2025-05-07T20:32:35.5591002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5591009Z 
2025-05-07T20:32:35.5591443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5591448Z 
2025-05-07T20:32:35.5591554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5591786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5591866Z     T=128,
2025-05-07T20:32:35.5591986Z     D=5120,
2025-05-07T20:32:35.5592077Z     scale_ub=1200.0,
2025-05-07T20:32:35.5592163Z     contiguous=True,
2025-05-07T20:32:35.5592247Z     compiled=False,
2025-05-07T20:32:35.5592325Z )
2025-05-07T20:32:35.5592595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5592773Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.5592777Z 
2025-05-07T20:32:35.5592858Z     @given(
2025-05-07T20:32:35.5592979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5593124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5593237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5593356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5593479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5593556Z     )
2025-05-07T20:32:35.5593808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5593907Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5593983Z         self,
2025-05-07T20:32:35.5594062Z         T: int,
2025-05-07T20:32:35.5594143Z         D: int,
2025-05-07T20:32:35.5594243Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5594333Z         contiguous: bool,
2025-05-07T20:32:35.5594422Z         compiled: bool,
2025-05-07T20:32:35.5594499Z     ) -> None:
2025-05-07T20:32:35.5594596Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5594667Z     
2025-05-07T20:32:35.5594838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5594918Z     
2025-05-07T20:32:35.5595008Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5595130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5595220Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5595301Z         x0 = x[:, :D]
2025-05-07T20:32:35.5595380Z         x1 = x[:, D:]
2025-05-07T20:32:35.5595454Z     
2025-05-07T20:32:35.5595541Z         if contiguous:
2025-05-07T20:32:35.5595631Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5595727Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5595796Z     
2025-05-07T20:32:35.5595887Z         if scale_ub is not None:
2025-05-07T20:32:35.5595997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5596137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5596212Z             )
2025-05-07T20:32:35.5596288Z         else:
2025-05-07T20:32:35.5596381Z             scale_ub_tensor = None
2025-05-07T20:32:35.5596455Z     
2025-05-07T20:32:35.5596586Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5596678Z             op = silu_mul_quant
2025-05-07T20:32:35.5596765Z             if compiled:
2025-05-07T20:32:35.5596915Z                 op = torch.compile(op)
2025-05-07T20:32:35.5597023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5597096Z     
2025-05-07T20:32:35.5597187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5597193Z 
2025-05-07T20:32:35.5597292Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5597423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5597524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5597631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5598150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5598247Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5602630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5602880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5603247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5603346Z     kernel = self.compile(
2025-05-07T20:32:35.5603808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5603991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5604164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5604170Z 
2025-05-07T20:32:35.5604386Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824fb60>
2025-05-07T20:32:35.5605196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5605764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822cc20>}
2025-05-07T20:32:35.5606829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5607036Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80b7530>
2025-05-07T20:32:35.5607045Z 
2025-05-07T20:32:35.5607222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5607496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5607606Z                            module_map=module_map)
2025-05-07T20:32:35.5607789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5607892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5607974Z E       ^
2025-05-07T20:32:35.5608346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5608351Z 
2025-05-07T20:32:35.5608784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5608789Z 
2025-05-07T20:32:35.5608897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5609130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5609211Z     T=1,
2025-05-07T20:32:35.5609297Z     D=7168,
2025-05-07T20:32:35.5609383Z     scale_ub=1200.0,
2025-05-07T20:32:35.5609474Z     contiguous=True,
2025-05-07T20:32:35.5609561Z     compiled=True,
2025-05-07T20:32:35.5609636Z )
2025-05-07T20:32:35.5609868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5610038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5610043Z 
2025-05-07T20:32:35.5610121Z     @given(
2025-05-07T20:32:35.5610356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5610458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5610575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5610693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5610806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5610886Z     )
2025-05-07T20:32:35.5611139Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5611232Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5611314Z         self,
2025-05-07T20:32:35.5611389Z         T: int,
2025-05-07T20:32:35.5611465Z         D: int,
2025-05-07T20:32:35.5611568Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5611659Z         contiguous: bool,
2025-05-07T20:32:35.5611745Z         compiled: bool,
2025-05-07T20:32:35.5611890Z     ) -> None:
2025-05-07T20:32:35.5611988Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5612066Z     
2025-05-07T20:32:35.5612240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5612313Z     
2025-05-07T20:32:35.5612478Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5612606Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5612697Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5612779Z         x0 = x[:, :D]
2025-05-07T20:32:35.5612916Z         x1 = x[:, D:]
2025-05-07T20:32:35.5612988Z     
2025-05-07T20:32:35.5613076Z         if contiguous:
2025-05-07T20:32:35.5613172Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5613259Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5613334Z     
2025-05-07T20:32:35.5613423Z         if scale_ub is not None:
2025-05-07T20:32:35.5613616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5613756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5613834Z             )
2025-05-07T20:32:35.5613913Z         else:
2025-05-07T20:32:35.5614010Z             scale_ub_tensor = None
2025-05-07T20:32:35.5614083Z     
2025-05-07T20:32:35.5614219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5614311Z             op = silu_mul_quant
2025-05-07T20:32:35.5614397Z             if compiled:
2025-05-07T20:32:35.5614501Z                 op = torch.compile(op)
2025-05-07T20:32:35.5614611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5614686Z     
2025-05-07T20:32:35.5614782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5614786Z 
2025-05-07T20:32:35.5614885Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5615017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5615120Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5615224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5615606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5615703Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5616219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5616323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5616691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5616924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5617281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5617374Z     kernel = self.compile(
2025-05-07T20:32:35.5617774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5617957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5618136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5618141Z 
2025-05-07T20:32:35.5618359Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824d610>
2025-05-07T20:32:35.5619166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5619697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822e2a0>}
2025-05-07T20:32:35.5620472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5620677Z context = <triton._C.libtriton.ir.context object at 0x7f8ac80d1e70>
2025-05-07T20:32:35.5620682Z 
2025-05-07T20:32:35.5620854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5621167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5621280Z                            module_map=module_map)
2025-05-07T20:32:35.5621445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5621584Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5621669Z E       ^
2025-05-07T20:32:35.5622037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5622042Z 
2025-05-07T20:32:35.5622477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5622522Z 
2025-05-07T20:32:35.5622626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5622858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5622944Z     T=1,
2025-05-07T20:32:35.5623022Z     D=7168,
2025-05-07T20:32:35.5623105Z     scale_ub=1200.0,
2025-05-07T20:32:35.5623198Z     contiguous=False,
2025-05-07T20:32:35.5623284Z     compiled=True,
2025-05-07T20:32:35.5623363Z )
2025-05-07T20:32:35.5623590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5623763Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5623770Z 
2025-05-07T20:32:35.5623852Z     @given(
2025-05-07T20:32:35.5623974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5624074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5624192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5624315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5624429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5624507Z     )
2025-05-07T20:32:35.5624769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5624869Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5624947Z         self,
2025-05-07T20:32:35.5625027Z         T: int,
2025-05-07T20:32:35.5625109Z         D: int,
2025-05-07T20:32:35.5625209Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5625299Z         contiguous: bool,
2025-05-07T20:32:35.5625393Z         compiled: bool,
2025-05-07T20:32:35.5625474Z     ) -> None:
2025-05-07T20:32:35.5625570Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5625648Z     
2025-05-07T20:32:35.5625820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5625894Z     
2025-05-07T20:32:35.5625993Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5626124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5626216Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5626298Z         x0 = x[:, :D]
2025-05-07T20:32:35.5626380Z         x1 = x[:, D:]
2025-05-07T20:32:35.5626456Z     
2025-05-07T20:32:35.5626588Z         if contiguous:
2025-05-07T20:32:35.5626680Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5626773Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5626848Z     
2025-05-07T20:32:35.5626941Z         if scale_ub is not None:
2025-05-07T20:32:35.5627052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5627191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5627271Z             )
2025-05-07T20:32:35.5627352Z         else:
2025-05-07T20:32:35.5627445Z             scale_ub_tensor = None
2025-05-07T20:32:35.5627520Z     
2025-05-07T20:32:35.5627654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5627747Z             op = silu_mul_quant
2025-05-07T20:32:35.5627837Z             if compiled:
2025-05-07T20:32:35.5627935Z                 op = torch.compile(op)
2025-05-07T20:32:35.5628040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5628122Z     
2025-05-07T20:32:35.5628221Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5628225Z 
2025-05-07T20:32:35.5628323Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5628502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5628604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5628703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5629083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5629215Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5629728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5629869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5630235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5630471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5630822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5630921Z     kernel = self.compile(
2025-05-07T20:32:35.5631315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5631492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5631634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5631638Z 
2025-05-07T20:32:35.5631847Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac824fef0>
2025-05-07T20:32:35.5632652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5633176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac822f9c0>}
2025-05-07T20:32:35.5633951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5634153Z context = <triton._C.libtriton.ir.context object at 0x7f8ac81a32b0>
2025-05-07T20:32:35.5634157Z 
2025-05-07T20:32:35.5634326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5634599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5634709Z                            module_map=module_map)
2025-05-07T20:32:35.5634873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5634976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5635053Z E       ^
2025-05-07T20:32:35.5635463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5635473Z 
2025-05-07T20:32:35.5635903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5635907Z 
2025-05-07T20:32:35.5636008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5636243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5636317Z     T=1,
2025-05-07T20:32:35.5636392Z     D=7168,
2025-05-07T20:32:35.5636481Z     scale_ub=None,
2025-05-07T20:32:35.5636567Z     contiguous=False,
2025-05-07T20:32:35.5636649Z     compiled=True,
2025-05-07T20:32:35.5636728Z )
2025-05-07T20:32:35.5636953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5637124Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5637132Z 
2025-05-07T20:32:35.5637208Z     @given(
2025-05-07T20:32:35.5637326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5637470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5637587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5637706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5637826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5637940Z     )
2025-05-07T20:32:35.5638196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5638293Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5638370Z         self,
2025-05-07T20:32:35.5638448Z         T: int,
2025-05-07T20:32:35.5638525Z         D: int,
2025-05-07T20:32:35.5638664Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5638762Z         contiguous: bool,
2025-05-07T20:32:35.5638848Z         compiled: bool,
2025-05-07T20:32:35.5638927Z     ) -> None:
2025-05-07T20:32:35.5639030Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5639105Z     
2025-05-07T20:32:35.5639273Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5639351Z     
2025-05-07T20:32:35.5639442Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5639567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5639657Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5639740Z         x0 = x[:, :D]
2025-05-07T20:32:35.5639823Z         x1 = x[:, D:]
2025-05-07T20:32:35.5639897Z     
2025-05-07T20:32:35.5639980Z         if contiguous:
2025-05-07T20:32:35.5640074Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5640163Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5640236Z     
2025-05-07T20:32:35.5640331Z         if scale_ub is not None:
2025-05-07T20:32:35.5640435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5640570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5640649Z             )
2025-05-07T20:32:35.5640729Z         else:
2025-05-07T20:32:35.5640823Z             scale_ub_tensor = None
2025-05-07T20:32:35.5640898Z     
2025-05-07T20:32:35.5641028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5641120Z             op = silu_mul_quant
2025-05-07T20:32:35.5641207Z             if compiled:
2025-05-07T20:32:35.5641304Z                 op = torch.compile(op)
2025-05-07T20:32:35.5641414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5641487Z     
2025-05-07T20:32:35.5641580Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.5641702Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.5641774Z     
2025-05-07T20:32:35.5641909Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5642025Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.5642125Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.5642246Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.5642438Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5642515Z     
2025-05-07T20:32:35.5642618Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.5642625Z 
2025-05-07T20:32:35.5642725Z moe/activation_test.py:126: 
2025-05-07T20:32:35.5642856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5642967Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.5643100Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.5643726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.5643831Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.5644203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5644435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5644812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.5645117Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.5645511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.5645739Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.5646094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.5646172Z     fn()
2025-05-07T20:32:35.5646587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.5646714Z     self.fn.run(
2025-05-07T20:32:35.5647063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5647161Z     kernel = self.compile(
2025-05-07T20:32:35.5647560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5647737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5647871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5647879Z 
2025-05-07T20:32:35.5648087Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812e300>
2025-05-07T20:32:35.5648888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5649412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8130b80>}
2025-05-07T20:32:35.5650194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5650395Z context = <triton._C.libtriton.ir.context object at 0x7f8ac81db4f0>
2025-05-07T20:32:35.5650400Z 
2025-05-07T20:32:35.5650568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5650844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5650954Z                            module_map=module_map)
2025-05-07T20:32:35.5651118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5651226Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.5651303Z E       ^
2025-05-07T20:32:35.5651675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5651721Z 
2025-05-07T20:32:35.5652232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5652240Z 
2025-05-07T20:32:35.5652344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5652576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5652653Z     T=1,
2025-05-07T20:32:35.5652731Z     D=5120,
2025-05-07T20:32:35.5652816Z     scale_ub=1200.0,
2025-05-07T20:32:35.5652902Z     contiguous=False,
2025-05-07T20:32:35.5652985Z     compiled=True,
2025-05-07T20:32:35.5653061Z )
2025-05-07T20:32:35.5653284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5653486Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5653491Z 
2025-05-07T20:32:35.5653584Z     @given(
2025-05-07T20:32:35.5653713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5653817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5653931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5654094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5654213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5654285Z     )
2025-05-07T20:32:35.5654537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5654672Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5654748Z         self,
2025-05-07T20:32:35.5654826Z         T: int,
2025-05-07T20:32:35.5654906Z         D: int,
2025-05-07T20:32:35.5655003Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5655090Z         contiguous: bool,
2025-05-07T20:32:35.5655295Z         compiled: bool,
2025-05-07T20:32:35.5655373Z     ) -> None:
2025-05-07T20:32:35.5655469Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5655540Z     
2025-05-07T20:32:35.5655714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5655792Z     
2025-05-07T20:32:35.5655888Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5656011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5656104Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5656183Z         x0 = x[:, :D]
2025-05-07T20:32:35.5656267Z         x1 = x[:, D:]
2025-05-07T20:32:35.5656342Z     
2025-05-07T20:32:35.5656428Z         if contiguous:
2025-05-07T20:32:35.5656518Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5656622Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5656723Z     
2025-05-07T20:32:35.5656850Z         if scale_ub is not None:
2025-05-07T20:32:35.5656988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5657125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5657207Z             )
2025-05-07T20:32:35.5657283Z         else:
2025-05-07T20:32:35.5657378Z             scale_ub_tensor = None
2025-05-07T20:32:35.5657455Z     
2025-05-07T20:32:35.5657588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5657677Z             op = silu_mul_quant
2025-05-07T20:32:35.5657764Z             if compiled:
2025-05-07T20:32:35.5657865Z                 op = torch.compile(op)
2025-05-07T20:32:35.5657968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5658041Z     
2025-05-07T20:32:35.5658136Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5658143Z 
2025-05-07T20:32:35.5658243Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5658375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5658475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5658576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5658958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5659051Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5659622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5659721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5660097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5660324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5660679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5660795Z     kernel = self.compile(
2025-05-07T20:32:35.5661327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5661511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5661645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5661650Z 
2025-05-07T20:32:35.5661860Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812ea20>
2025-05-07T20:32:35.5662733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5663257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8131e40>}
2025-05-07T20:32:35.5664081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5664316Z context = <triton._C.libtriton.ir.context object at 0x7f8917eb0bf0>
2025-05-07T20:32:35.5664320Z 
2025-05-07T20:32:35.5664490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5664770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5664877Z                            module_map=module_map)
2025-05-07T20:32:35.5665040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5665145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5665220Z E       ^
2025-05-07T20:32:35.5665599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5665603Z 
2025-05-07T20:32:35.5666038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5666042Z 
2025-05-07T20:32:35.5666147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5666382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5666458Z     T=1,
2025-05-07T20:32:35.5666536Z     D=5120,
2025-05-07T20:32:35.5666623Z     scale_ub=1200.0,
2025-05-07T20:32:35.5666708Z     contiguous=False,
2025-05-07T20:32:35.5666798Z     compiled=False,
2025-05-07T20:32:35.5666869Z )
2025-05-07T20:32:35.5667100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5667282Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5667286Z 
2025-05-07T20:32:35.5667366Z     @given(
2025-05-07T20:32:35.5667484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5667584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5667697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5667813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5667933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5668006Z     )
2025-05-07T20:32:35.5668263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5668401Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5668477Z         self,
2025-05-07T20:32:35.5668562Z         T: int,
2025-05-07T20:32:35.5668637Z         D: int,
2025-05-07T20:32:35.5668737Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5668828Z         contiguous: bool,
2025-05-07T20:32:35.5668912Z         compiled: bool,
2025-05-07T20:32:35.5668987Z     ) -> None:
2025-05-07T20:32:35.5669085Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5669160Z     
2025-05-07T20:32:35.5669330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5669406Z     
2025-05-07T20:32:35.5669497Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5669625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5669714Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5669796Z         x0 = x[:, :D]
2025-05-07T20:32:35.5669877Z         x1 = x[:, D:]
2025-05-07T20:32:35.5669948Z     
2025-05-07T20:32:35.5670030Z         if contiguous:
2025-05-07T20:32:35.5670127Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5670215Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5670284Z     
2025-05-07T20:32:35.5670376Z         if scale_ub is not None:
2025-05-07T20:32:35.5670524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5670663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5670745Z             )
2025-05-07T20:32:35.5670859Z         else:
2025-05-07T20:32:35.5670952Z             scale_ub_tensor = None
2025-05-07T20:32:35.5671027Z     
2025-05-07T20:32:35.5671154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5671246Z             op = silu_mul_quant
2025-05-07T20:32:35.5671329Z             if compiled:
2025-05-07T20:32:35.5671427Z                 op = torch.compile(op)
2025-05-07T20:32:35.5672300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5672373Z     
2025-05-07T20:32:35.5672462Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5672466Z 
2025-05-07T20:32:35.5672578Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5672728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5672854Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5672956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5673478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5673587Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5673960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5674190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5674556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5674652Z     kernel = self.compile(
2025-05-07T20:32:35.5675055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5675234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5675372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5675376Z 
2025-05-07T20:32:35.5675591Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac812ffb0>
2025-05-07T20:32:35.5676405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5676933Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8132ac0>}
2025-05-07T20:32:35.5677759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5677958Z context = <triton._C.libtriton.ir.context object at 0x7f8917a50e30>
2025-05-07T20:32:35.5677963Z 
2025-05-07T20:32:35.5678138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5678411Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5678524Z                            module_map=module_map)
2025-05-07T20:32:35.5678688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5678788Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5678867Z E       ^
2025-05-07T20:32:35.5679238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5679246Z 
2025-05-07T20:32:35.5679681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5679688Z 
2025-05-07T20:32:35.5679790Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5680087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5680176Z     T=16384,
2025-05-07T20:32:35.5680254Z     D=5120,
2025-05-07T20:32:35.5680340Z     scale_ub=1200.0,
2025-05-07T20:32:35.5680429Z     contiguous=False,
2025-05-07T20:32:35.5680551Z     compiled=True,
2025-05-07T20:32:35.5680625Z )
2025-05-07T20:32:35.5680855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5681043Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5681047Z 
2025-05-07T20:32:35.5681166Z     @given(
2025-05-07T20:32:35.5681287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5681387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5681508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5681630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5681745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5681822Z     )
2025-05-07T20:32:35.5682077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5682170Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5682251Z         self,
2025-05-07T20:32:35.5682332Z         T: int,
2025-05-07T20:32:35.5682407Z         D: int,
2025-05-07T20:32:35.5682510Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5682599Z         contiguous: bool,
2025-05-07T20:32:35.5682711Z         compiled: bool,
2025-05-07T20:32:35.5682794Z     ) -> None:
2025-05-07T20:32:35.5682913Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5682995Z     
2025-05-07T20:32:35.5683167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5683242Z     
2025-05-07T20:32:35.5683339Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5683467Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5683557Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5683640Z         x0 = x[:, :D]
2025-05-07T20:32:35.5683721Z         x1 = x[:, D:]
2025-05-07T20:32:35.5683792Z     
2025-05-07T20:32:35.5683879Z         if contiguous:
2025-05-07T20:32:35.5683971Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5684060Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5684133Z     
2025-05-07T20:32:35.5684223Z         if scale_ub is not None:
2025-05-07T20:32:35.5684333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5684470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5684544Z             )
2025-05-07T20:32:35.5684623Z         else:
2025-05-07T20:32:35.5684720Z             scale_ub_tensor = None
2025-05-07T20:32:35.5684790Z     
2025-05-07T20:32:35.5684920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5685010Z             op = silu_mul_quant
2025-05-07T20:32:35.5685140Z             if compiled:
2025-05-07T20:32:35.5685247Z                 op = torch.compile(op)
2025-05-07T20:32:35.5685354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5685426Z     
2025-05-07T20:32:35.5685515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5685520Z 
2025-05-07T20:32:35.5685617Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5685753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5685855Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5685955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5686334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5686427Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5686943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5687041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5687410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5687685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5688035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5688168Z     kernel = self.compile(
2025-05-07T20:32:35.5688570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5688745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5688876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5688920Z 
2025-05-07T20:32:35.5689128Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e304a0>
2025-05-07T20:32:35.5689937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5690462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4c180>}
2025-05-07T20:32:35.5691239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5691433Z context = <triton._C.libtriton.ir.context object at 0x7f8917ee9470>
2025-05-07T20:32:35.5691440Z 
2025-05-07T20:32:35.5691605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5691940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5692055Z                            module_map=module_map)
2025-05-07T20:32:35.5692215Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5692321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5692396Z E       ^
2025-05-07T20:32:35.5692762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5692769Z 
2025-05-07T20:32:35.5693203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5693208Z 
2025-05-07T20:32:35.5693309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5693540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5693621Z     T=2048,
2025-05-07T20:32:35.5693697Z     D=7168,
2025-05-07T20:32:35.5693783Z     scale_ub=1200.0,
2025-05-07T20:32:35.5693867Z     contiguous=False,
2025-05-07T20:32:35.5693948Z     compiled=True,
2025-05-07T20:32:35.5694073Z )
2025-05-07T20:32:35.5694298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5694483Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5694487Z 
2025-05-07T20:32:35.5694564Z     @given(
2025-05-07T20:32:35.5694683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5694785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5694900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5695015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5695132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5695206Z     )
2025-05-07T20:32:35.5695461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5695555Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5695630Z         self,
2025-05-07T20:32:35.5695706Z         T: int,
2025-05-07T20:32:35.5695788Z         D: int,
2025-05-07T20:32:35.5695884Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5695973Z         contiguous: bool,
2025-05-07T20:32:35.5696060Z         compiled: bool,
2025-05-07T20:32:35.5696182Z     ) -> None:
2025-05-07T20:32:35.5696281Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5696352Z     
2025-05-07T20:32:35.5696522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5696634Z     
2025-05-07T20:32:35.5696726Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5696851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5696942Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5697020Z         x0 = x[:, :D]
2025-05-07T20:32:35.5697099Z         x1 = x[:, D:]
2025-05-07T20:32:35.5697216Z     
2025-05-07T20:32:35.5697300Z         if contiguous:
2025-05-07T20:32:35.5697389Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5697481Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5697551Z     
2025-05-07T20:32:35.5697647Z         if scale_ub is not None:
2025-05-07T20:32:35.5697752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5697892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5697971Z             )
2025-05-07T20:32:35.5698049Z         else:
2025-05-07T20:32:35.5698145Z             scale_ub_tensor = None
2025-05-07T20:32:35.5698218Z     
2025-05-07T20:32:35.5698349Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5698438Z             op = silu_mul_quant
2025-05-07T20:32:35.5698526Z             if compiled:
2025-05-07T20:32:35.5698623Z                 op = torch.compile(op)
2025-05-07T20:32:35.5698726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5698805Z     
2025-05-07T20:32:35.5698894Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5698898Z 
2025-05-07T20:32:35.5698996Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5699130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5699230Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5699330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5699713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5699807Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5700325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5700427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5700798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5701026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5701381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5701484Z     kernel = self.compile(
2025-05-07T20:32:35.5701926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5702109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5702245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5702249Z 
2025-05-07T20:32:35.5702460Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e32ea0>
2025-05-07T20:32:35.5703276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5703797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4cea0>}
2025-05-07T20:32:35.5704582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5704818Z context = <triton._C.libtriton.ir.context object at 0x7f8ac84504b0>
2025-05-07T20:32:35.5704823Z 
2025-05-07T20:32:35.5704992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5705306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5705413Z                            module_map=module_map)
2025-05-07T20:32:35.5705579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5705679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5705796Z E       ^
2025-05-07T20:32:35.5706341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5706349Z 
2025-05-07T20:32:35.5706850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5706855Z 
2025-05-07T20:32:35.5706963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5707200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5707277Z     T=1,
2025-05-07T20:32:35.5707359Z     D=5120,
2025-05-07T20:32:35.5707445Z     scale_ub=None,
2025-05-07T20:32:35.5707531Z     contiguous=False,
2025-05-07T20:32:35.5707618Z     compiled=False,
2025-05-07T20:32:35.5707691Z )
2025-05-07T20:32:35.5707919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5708095Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5708102Z 
2025-05-07T20:32:35.5708180Z     @given(
2025-05-07T20:32:35.5708298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5708401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5708517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5708636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5708751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5708824Z     )
2025-05-07T20:32:35.5709078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5709171Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5709249Z         self,
2025-05-07T20:32:35.5709328Z         T: int,
2025-05-07T20:32:35.5709403Z         D: int,
2025-05-07T20:32:35.5709499Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5709592Z         contiguous: bool,
2025-05-07T20:32:35.5709678Z         compiled: bool,
2025-05-07T20:32:35.5709757Z     ) -> None:
2025-05-07T20:32:35.5709857Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5709927Z     
2025-05-07T20:32:35.5710100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5710175Z     
2025-05-07T20:32:35.5710382Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5710510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5710597Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5710678Z         x0 = x[:, :D]
2025-05-07T20:32:35.5710759Z         x1 = x[:, D:]
2025-05-07T20:32:35.5710831Z     
2025-05-07T20:32:35.5710915Z         if contiguous:
2025-05-07T20:32:35.5711008Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5711098Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5711171Z     
2025-05-07T20:32:35.5711265Z         if scale_ub is not None:
2025-05-07T20:32:35.5711367Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5711505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5711582Z             )
2025-05-07T20:32:35.5711657Z         else:
2025-05-07T20:32:35.5711753Z             scale_ub_tensor = None
2025-05-07T20:32:35.5711825Z     
2025-05-07T20:32:35.5711954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5712050Z             op = silu_mul_quant
2025-05-07T20:32:35.5712133Z             if compiled:
2025-05-07T20:32:35.5712230Z                 op = torch.compile(op)
2025-05-07T20:32:35.5712399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5712473Z     
2025-05-07T20:32:35.5712563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5712568Z 
2025-05-07T20:32:35.5712722Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5712853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5712955Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5713055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5713567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5713725Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5714098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5714324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5714681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5714775Z     kernel = self.compile(
2025-05-07T20:32:35.5715175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5715353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5715483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5715487Z 
2025-05-07T20:32:35.5715697Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e33650>
2025-05-07T20:32:35.5716506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5717033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4de40>}
2025-05-07T20:32:35.5717811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5718012Z context = <triton._C.libtriton.ir.context object at 0x7f8ac84e92f0>
2025-05-07T20:32:35.5718016Z 
2025-05-07T20:32:35.5718184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5718458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5718567Z                            module_map=module_map)
2025-05-07T20:32:35.5718772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5718871Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5718950Z E       ^
2025-05-07T20:32:35.5719317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5719321Z 
2025-05-07T20:32:35.5719754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5719761Z 
2025-05-07T20:32:35.5719863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5720091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5720171Z     T=4096,
2025-05-07T20:32:35.5720247Z     D=7168,
2025-05-07T20:32:35.5720334Z     scale_ub=1200.0,
2025-05-07T20:32:35.5720427Z     contiguous=False,
2025-05-07T20:32:35.5720511Z     compiled=False,
2025-05-07T20:32:35.5720582Z )
2025-05-07T20:32:35.5720810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5720998Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5721002Z 
2025-05-07T20:32:35.5721081Z     @given(
2025-05-07T20:32:35.5721241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5721343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5721461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5721617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5721729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5721805Z     )
2025-05-07T20:32:35.5722055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5722152Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5722276Z         self,
2025-05-07T20:32:35.5722351Z         T: int,
2025-05-07T20:32:35.5722431Z         D: int,
2025-05-07T20:32:35.5722528Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5722616Z         contiguous: bool,
2025-05-07T20:32:35.5722707Z         compiled: bool,
2025-05-07T20:32:35.5722784Z     ) -> None:
2025-05-07T20:32:35.5722876Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5722958Z     
2025-05-07T20:32:35.5723127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5723199Z     
2025-05-07T20:32:35.5723292Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5723419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5723510Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5723588Z         x0 = x[:, :D]
2025-05-07T20:32:35.5723667Z         x1 = x[:, D:]
2025-05-07T20:32:35.5723741Z     
2025-05-07T20:32:35.5723824Z         if contiguous:
2025-05-07T20:32:35.5723913Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5724010Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5724080Z     
2025-05-07T20:32:35.5724170Z         if scale_ub is not None:
2025-05-07T20:32:35.5724274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5724410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5724484Z             )
2025-05-07T20:32:35.5724562Z         else:
2025-05-07T20:32:35.5724659Z             scale_ub_tensor = None
2025-05-07T20:32:35.5724731Z     
2025-05-07T20:32:35.5724861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5724950Z             op = silu_mul_quant
2025-05-07T20:32:35.5725038Z             if compiled:
2025-05-07T20:32:35.5725135Z                 op = torch.compile(op)
2025-05-07T20:32:35.5725240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5725314Z     
2025-05-07T20:32:35.5725403Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5725408Z 
2025-05-07T20:32:35.5725505Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5725642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5725744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5725891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5729843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5729961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5730344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5730576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5730929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5731025Z     kernel = self.compile(
2025-05-07T20:32:35.5731421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5731607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5731739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5731744Z 
2025-05-07T20:32:35.5732039Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e32ff0>
2025-05-07T20:32:35.5732923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5733586Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e4f380>}
2025-05-07T20:32:35.5734517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5734772Z context = <triton._C.libtriton.ir.context object at 0x7f8917d7a3f0>
2025-05-07T20:32:35.5734777Z 
2025-05-07T20:32:35.5734967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5735279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5735389Z                            module_map=module_map)
2025-05-07T20:32:35.5735565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5735668Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5735747Z E       ^
2025-05-07T20:32:35.5736175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5736180Z 
2025-05-07T20:32:35.5736679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5736687Z 
2025-05-07T20:32:35.5736797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5737053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5737133Z     T=16384,
2025-05-07T20:32:35.5737214Z     D=7168,
2025-05-07T20:32:35.5737298Z     scale_ub=None,
2025-05-07T20:32:35.5737384Z     contiguous=True,
2025-05-07T20:32:35.5737472Z     compiled=True,
2025-05-07T20:32:35.5737545Z )
2025-05-07T20:32:35.5737794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5737990Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.5737999Z 
2025-05-07T20:32:35.5738076Z     @given(
2025-05-07T20:32:35.5738202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5738306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5738425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5738555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5738674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5738747Z     )
2025-05-07T20:32:35.5739079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5739176Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5739251Z         self,
2025-05-07T20:32:35.5739331Z         T: int,
2025-05-07T20:32:35.5739408Z         D: int,
2025-05-07T20:32:35.5739513Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5739604Z         contiguous: bool,
2025-05-07T20:32:35.5739691Z         compiled: bool,
2025-05-07T20:32:35.5739775Z     ) -> None:
2025-05-07T20:32:35.5739870Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5739944Z     
2025-05-07T20:32:35.5740132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5740206Z     
2025-05-07T20:32:35.5740302Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5740438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5740527Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5740607Z         x0 = x[:, :D]
2025-05-07T20:32:35.5740690Z         x1 = x[:, D:]
2025-05-07T20:32:35.5740762Z     
2025-05-07T20:32:35.5740851Z         if contiguous:
2025-05-07T20:32:35.5740942Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5741032Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5741155Z     
2025-05-07T20:32:35.5741246Z         if scale_ub is not None:
2025-05-07T20:32:35.5741351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5741489Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5741602Z             )
2025-05-07T20:32:35.5741677Z         else:
2025-05-07T20:32:35.5741775Z             scale_ub_tensor = None
2025-05-07T20:32:35.5741845Z     
2025-05-07T20:32:35.5741974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5742067Z             op = silu_mul_quant
2025-05-07T20:32:35.5742217Z             if compiled:
2025-05-07T20:32:35.5742316Z                 op = torch.compile(op)
2025-05-07T20:32:35.5742425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5742496Z     
2025-05-07T20:32:35.5742593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5742597Z 
2025-05-07T20:32:35.5742694Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5742827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5742930Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5743027Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5743404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5743503Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5744013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5744111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5744479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5744707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5745060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5745155Z     kernel = self.compile(
2025-05-07T20:32:35.5745549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5745727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5745859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5745863Z 
2025-05-07T20:32:35.5746075Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d808f0>
2025-05-07T20:32:35.5746877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5747446Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dac4a0>}
2025-05-07T20:32:35.5748226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5748423Z context = <triton._C.libtriton.ir.context object at 0x7f8917dea2f0>
2025-05-07T20:32:35.5748427Z 
2025-05-07T20:32:35.5748598Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5748869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5748983Z                            module_map=module_map)
2025-05-07T20:32:35.5749148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5749247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5749326Z E       ^
2025-05-07T20:32:35.5749697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5749701Z 
2025-05-07T20:32:35.5750178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5750187Z 
2025-05-07T20:32:35.5750291Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5750562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5750641Z     T=4096,
2025-05-07T20:32:35.5750717Z     D=5120,
2025-05-07T20:32:35.5750798Z     scale_ub=None,
2025-05-07T20:32:35.5750888Z     contiguous=False,
2025-05-07T20:32:35.5750969Z     compiled=True,
2025-05-07T20:32:35.5751082Z )
2025-05-07T20:32:35.5751313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5751491Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5751495Z 
2025-05-07T20:32:35.5751577Z     @given(
2025-05-07T20:32:35.5751696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5751795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5751912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5752028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5752143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5752222Z     )
2025-05-07T20:32:35.5752476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5752567Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5752644Z         self,
2025-05-07T20:32:35.5752721Z         T: int,
2025-05-07T20:32:35.5752795Z         D: int,
2025-05-07T20:32:35.5752901Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5752989Z         contiguous: bool,
2025-05-07T20:32:35.5753078Z         compiled: bool,
2025-05-07T20:32:35.5753153Z     ) -> None:
2025-05-07T20:32:35.5753249Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5753325Z     
2025-05-07T20:32:35.5753497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5753574Z     
2025-05-07T20:32:35.5753667Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5753791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5753878Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5753963Z         x0 = x[:, :D]
2025-05-07T20:32:35.5754041Z         x1 = x[:, D:]
2025-05-07T20:32:35.5754111Z     
2025-05-07T20:32:35.5754196Z         if contiguous:
2025-05-07T20:32:35.5754286Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5754376Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5754447Z     
2025-05-07T20:32:35.5754539Z         if scale_ub is not None:
2025-05-07T20:32:35.5754652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5754788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5754862Z             )
2025-05-07T20:32:35.5754989Z         else:
2025-05-07T20:32:35.5755084Z             scale_ub_tensor = None
2025-05-07T20:32:35.5755154Z     
2025-05-07T20:32:35.5755286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5755376Z             op = silu_mul_quant
2025-05-07T20:32:35.5755458Z             if compiled:
2025-05-07T20:32:35.5755561Z                 op = torch.compile(op)
2025-05-07T20:32:35.5755670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5755739Z     
2025-05-07T20:32:35.5755836Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5755840Z 
2025-05-07T20:32:35.5755936Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5756068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5756171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5756270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5756661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5756754Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5757312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5757430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5757915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5758207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5758561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5758652Z     kernel = self.compile(
2025-05-07T20:32:35.5759099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5759275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5759410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5759414Z 
2025-05-07T20:32:35.5759625Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d80fb0>
2025-05-07T20:32:35.5760436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5760959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dad1c0>}
2025-05-07T20:32:35.5761739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5761939Z context = <triton._C.libtriton.ir.context object at 0x7f8917d15170>
2025-05-07T20:32:35.5761944Z 
2025-05-07T20:32:35.5762114Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5762394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5762505Z                            module_map=module_map)
2025-05-07T20:32:35.5762668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5762803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5762884Z E       ^
2025-05-07T20:32:35.5763272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5763276Z 
2025-05-07T20:32:35.5763718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5763725Z 
2025-05-07T20:32:35.5763826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5764103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5764181Z     T=4096,
2025-05-07T20:32:35.5764257Z     D=5120,
2025-05-07T20:32:35.5764345Z     scale_ub=1200.0,
2025-05-07T20:32:35.5764431Z     contiguous=False,
2025-05-07T20:32:35.5764514Z     compiled=False,
2025-05-07T20:32:35.5764589Z )
2025-05-07T20:32:35.5764815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5764999Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5765004Z 
2025-05-07T20:32:35.5765082Z     @given(
2025-05-07T20:32:35.5765201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5765309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5765425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5765543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5765658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5765730Z     )
2025-05-07T20:32:35.5765985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5766080Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5766201Z         self,
2025-05-07T20:32:35.5766280Z         T: int,
2025-05-07T20:32:35.5766360Z         D: int,
2025-05-07T20:32:35.5766456Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5766546Z         contiguous: bool,
2025-05-07T20:32:35.5766680Z         compiled: bool,
2025-05-07T20:32:35.5766758Z     ) -> None:
2025-05-07T20:32:35.5766853Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5766924Z     
2025-05-07T20:32:35.5767094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5767168Z     
2025-05-07T20:32:35.5767304Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5767428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5767516Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5767595Z         x0 = x[:, :D]
2025-05-07T20:32:35.5767675Z         x1 = x[:, D:]
2025-05-07T20:32:35.5767750Z     
2025-05-07T20:32:35.5767833Z         if contiguous:
2025-05-07T20:32:35.5767923Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5768014Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5768088Z     
2025-05-07T20:32:35.5768178Z         if scale_ub is not None:
2025-05-07T20:32:35.5768286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5768425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5768501Z             )
2025-05-07T20:32:35.5768577Z         else:
2025-05-07T20:32:35.5768669Z             scale_ub_tensor = None
2025-05-07T20:32:35.5768743Z     
2025-05-07T20:32:35.5768870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5768962Z             op = silu_mul_quant
2025-05-07T20:32:35.5769049Z             if compiled:
2025-05-07T20:32:35.5769148Z                 op = torch.compile(op)
2025-05-07T20:32:35.5769253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5769328Z     
2025-05-07T20:32:35.5769417Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5769421Z 
2025-05-07T20:32:35.5769520Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5769651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5769752Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5769852Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5770376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5770471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5770846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5771079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5771483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5771576Z     kernel = self.compile(
2025-05-07T20:32:35.5772061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5772247Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5772378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5772385Z 
2025-05-07T20:32:35.5772591Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d83b30>
2025-05-07T20:32:35.5773417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5773949Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917dae160>}
2025-05-07T20:32:35.5774791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5774988Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8b3a930>
2025-05-07T20:32:35.5774993Z 
2025-05-07T20:32:35.5775230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5775504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5775611Z                            module_map=module_map)
2025-05-07T20:32:35.5775776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5775916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5775992Z E       ^
2025-05-07T20:32:35.5776368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5776372Z 
2025-05-07T20:32:35.5776809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5776814Z 
2025-05-07T20:32:35.5776919Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5777149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5777227Z     T=4096,
2025-05-07T20:32:35.5777305Z     D=5120,
2025-05-07T20:32:35.5777387Z     scale_ub=1200.0,
2025-05-07T20:32:35.5777472Z     contiguous=False,
2025-05-07T20:32:35.5777558Z     compiled=True,
2025-05-07T20:32:35.5777630Z )
2025-05-07T20:32:35.5777859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5778042Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5778047Z 
2025-05-07T20:32:35.5778121Z     @given(
2025-05-07T20:32:35.5778244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5778346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5778460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5778584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5778697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5778772Z     )
2025-05-07T20:32:35.5779027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5779120Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5779197Z         self,
2025-05-07T20:32:35.5779273Z         T: int,
2025-05-07T20:32:35.5779348Z         D: int,
2025-05-07T20:32:35.5779448Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5779536Z         contiguous: bool,
2025-05-07T20:32:35.5779622Z         compiled: bool,
2025-05-07T20:32:35.5779703Z     ) -> None:
2025-05-07T20:32:35.5779796Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5779868Z     
2025-05-07T20:32:35.5780089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5780164Z     
2025-05-07T20:32:35.5780256Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5780384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5780472Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5780556Z         x0 = x[:, :D]
2025-05-07T20:32:35.5780634Z         x1 = x[:, D:]
2025-05-07T20:32:35.5780707Z     
2025-05-07T20:32:35.5780795Z         if contiguous:
2025-05-07T20:32:35.5780885Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5780972Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5781046Z     
2025-05-07T20:32:35.5781135Z         if scale_ub is not None:
2025-05-07T20:32:35.5781237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5781378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5781454Z             )
2025-05-07T20:32:35.5781530Z         else:
2025-05-07T20:32:35.5781625Z             scale_ub_tensor = None
2025-05-07T20:32:35.5781700Z     
2025-05-07T20:32:35.5781832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5781920Z             op = silu_mul_quant
2025-05-07T20:32:35.5782049Z             if compiled:
2025-05-07T20:32:35.5782154Z                 op = torch.compile(op)
2025-05-07T20:32:35.5782259Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5782331Z     
2025-05-07T20:32:35.5782496Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5782500Z 
2025-05-07T20:32:35.5782596Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5782726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5782830Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5782927Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5783358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5783450Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5783972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5784074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5784452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5784682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5785045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5785138Z     kernel = self.compile(
2025-05-07T20:32:35.5785544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5785724Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5785853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5785858Z 
2025-05-07T20:32:35.5786069Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d81cd0>
2025-05-07T20:32:35.5786895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5787425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917daf240>}
2025-05-07T20:32:35.5788216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5788410Z context = <triton._C.libtriton.ir.context object at 0x7f8917ab3cf0>
2025-05-07T20:32:35.5788420Z 
2025-05-07T20:32:35.5788629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5788904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5789017Z                            module_map=module_map)
2025-05-07T20:32:35.5789180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5789277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5789357Z E       ^
2025-05-07T20:32:35.5789730Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5789734Z 
2025-05-07T20:32:35.5790176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5790183Z 
2025-05-07T20:32:35.5790286Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5790516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5790594Z     T=2048,
2025-05-07T20:32:35.5790671Z     D=7168,
2025-05-07T20:32:35.5790760Z     scale_ub=1200.0,
2025-05-07T20:32:35.5790849Z     contiguous=False,
2025-05-07T20:32:35.5790931Z     compiled=False,
2025-05-07T20:32:35.5791004Z )
2025-05-07T20:32:35.5791278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5791461Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5791590Z 
2025-05-07T20:32:35.5791670Z     @given(
2025-05-07T20:32:35.5791791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5791889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5792007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5792123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5792277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5792352Z     )
2025-05-07T20:32:35.5792608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5792705Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5792780Z         self,
2025-05-07T20:32:35.5792855Z         T: int,
2025-05-07T20:32:35.5792935Z         D: int,
2025-05-07T20:32:35.5793035Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5793123Z         contiguous: bool,
2025-05-07T20:32:35.5793214Z         compiled: bool,
2025-05-07T20:32:35.5793289Z     ) -> None:
2025-05-07T20:32:35.5793386Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5793459Z     
2025-05-07T20:32:35.5793628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5793701Z     
2025-05-07T20:32:35.5793796Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5793919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5794009Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5794089Z         x0 = x[:, :D]
2025-05-07T20:32:35.5794167Z         x1 = x[:, D:]
2025-05-07T20:32:35.5794238Z     
2025-05-07T20:32:35.5794326Z         if contiguous:
2025-05-07T20:32:35.5794417Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5794507Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5794577Z     
2025-05-07T20:32:35.5794667Z         if scale_ub is not None:
2025-05-07T20:32:35.5794775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5794911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5794987Z             )
2025-05-07T20:32:35.5795068Z         else:
2025-05-07T20:32:35.5795161Z             scale_ub_tensor = None
2025-05-07T20:32:35.5795232Z     
2025-05-07T20:32:35.5795365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5795453Z             op = silu_mul_quant
2025-05-07T20:32:35.5795536Z             if compiled:
2025-05-07T20:32:35.5795639Z                 op = torch.compile(op)
2025-05-07T20:32:35.5795742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5795814Z     
2025-05-07T20:32:35.5795903Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5795907Z 
2025-05-07T20:32:35.5796051Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5796187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5796289Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5796388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5796915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5797011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5797386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5797616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5797976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5798071Z     kernel = self.compile(
2025-05-07T20:32:35.5798472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5798694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5798827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5798831Z 
2025-05-07T20:32:35.5799039Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa1820>
2025-05-07T20:32:35.5799905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5800429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a54220>}
2025-05-07T20:32:35.5801260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5801458Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8b4a3f0>
2025-05-07T20:32:35.5801463Z 
2025-05-07T20:32:35.5801630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5801909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5802020Z                            module_map=module_map)
2025-05-07T20:32:35.5802181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5802283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5802360Z E       ^
2025-05-07T20:32:35.5802736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5802741Z 
2025-05-07T20:32:35.5803179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5803183Z 
2025-05-07T20:32:35.5803285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5803539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5803622Z     T=1,
2025-05-07T20:32:35.5803714Z     D=7168,
2025-05-07T20:32:35.5803804Z     scale_ub=None,
2025-05-07T20:32:35.5803889Z     contiguous=True,
2025-05-07T20:32:35.5803974Z     compiled=False,
2025-05-07T20:32:35.5804047Z )
2025-05-07T20:32:35.5804273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5804443Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.5804447Z 
2025-05-07T20:32:35.5804525Z     @given(
2025-05-07T20:32:35.5804644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5804745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5804903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5805026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5805138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5805211Z     )
2025-05-07T20:32:35.5805467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5805560Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5805637Z         self,
2025-05-07T20:32:35.5805715Z         T: int,
2025-05-07T20:32:35.5805790Z         D: int,
2025-05-07T20:32:35.5805888Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5805980Z         contiguous: bool,
2025-05-07T20:32:35.5806066Z         compiled: bool,
2025-05-07T20:32:35.5806319Z     ) -> None:
2025-05-07T20:32:35.5806459Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5806566Z     
2025-05-07T20:32:35.5806757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5806832Z     
2025-05-07T20:32:35.5806923Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5807052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5807140Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5807219Z         x0 = x[:, :D]
2025-05-07T20:32:35.5807416Z         x1 = x[:, D:]
2025-05-07T20:32:35.5807489Z     
2025-05-07T20:32:35.5807572Z         if contiguous:
2025-05-07T20:32:35.5807665Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5807807Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5807879Z     
2025-05-07T20:32:35.5807973Z         if scale_ub is not None:
2025-05-07T20:32:35.5808078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5808215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5808293Z             )
2025-05-07T20:32:35.5808430Z         else:
2025-05-07T20:32:35.5808529Z             scale_ub_tensor = None
2025-05-07T20:32:35.5808601Z     
2025-05-07T20:32:35.5808733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5808829Z             op = silu_mul_quant
2025-05-07T20:32:35.5808912Z             if compiled:
2025-05-07T20:32:35.5809010Z                 op = torch.compile(op)
2025-05-07T20:32:35.5809121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5809193Z     
2025-05-07T20:32:35.5809282Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5809287Z 
2025-05-07T20:32:35.5809386Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5809518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5809623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5809722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5810240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5810342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5810715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5810945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5811304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5811398Z     kernel = self.compile(
2025-05-07T20:32:35.5811802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5812042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5812171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5812176Z 
2025-05-07T20:32:35.5812385Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa0e30>
2025-05-07T20:32:35.5813274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5813812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a55120>}
2025-05-07T20:32:35.5814605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5814803Z context = <triton._C.libtriton.ir.context object at 0x7f8917bd8630>
2025-05-07T20:32:35.5814812Z 
2025-05-07T20:32:35.5814979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5815255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5815367Z                            module_map=module_map)
2025-05-07T20:32:35.5815532Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5815631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5815711Z E       ^
2025-05-07T20:32:35.5816125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5816130Z 
2025-05-07T20:32:35.5816574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5816617Z 
2025-05-07T20:32:35.5816720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5816953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5817032Z     T=16384,
2025-05-07T20:32:35.5817107Z     D=7168,
2025-05-07T20:32:35.5817190Z     scale_ub=1200.0,
2025-05-07T20:32:35.5817322Z     contiguous=False,
2025-05-07T20:32:35.5817404Z     compiled=True,
2025-05-07T20:32:35.5817475Z )
2025-05-07T20:32:35.5817705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5817896Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5817900Z 
2025-05-07T20:32:35.5817980Z     @given(
2025-05-07T20:32:35.5818100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5818199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5818317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5818433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5818550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5818627Z     )
2025-05-07T20:32:35.5818881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5818976Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5819052Z         self,
2025-05-07T20:32:35.5819130Z         T: int,
2025-05-07T20:32:35.5819207Z         D: int,
2025-05-07T20:32:35.5819304Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5819393Z         contiguous: bool,
2025-05-07T20:32:35.5819486Z         compiled: bool,
2025-05-07T20:32:35.5819562Z     ) -> None:
2025-05-07T20:32:35.5819657Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5819730Z     
2025-05-07T20:32:35.5819903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5819976Z     
2025-05-07T20:32:35.5820070Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5820194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5820284Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5820368Z         x0 = x[:, :D]
2025-05-07T20:32:35.5820446Z         x1 = x[:, D:]
2025-05-07T20:32:35.5820517Z     
2025-05-07T20:32:35.5820598Z         if contiguous:
2025-05-07T20:32:35.5820687Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5820780Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5820850Z     
2025-05-07T20:32:35.5820939Z         if scale_ub is not None:
2025-05-07T20:32:35.5821045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5821227Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5821304Z             )
2025-05-07T20:32:35.5821385Z         else:
2025-05-07T20:32:35.5821478Z             scale_ub_tensor = None
2025-05-07T20:32:35.5821551Z     
2025-05-07T20:32:35.5821682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5821773Z             op = silu_mul_quant
2025-05-07T20:32:35.5821860Z             if compiled:
2025-05-07T20:32:35.5821960Z                 op = torch.compile(op)
2025-05-07T20:32:35.5822064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5822137Z     
2025-05-07T20:32:35.5822227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5822232Z 
2025-05-07T20:32:35.5822328Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5822464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5822562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5822660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5823053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5823146Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5823714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5823814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5824232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5824464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5824817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5824949Z     kernel = self.compile(
2025-05-07T20:32:35.5825352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5825532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5825664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5825671Z 
2025-05-07T20:32:35.5825879Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa25a0>
2025-05-07T20:32:35.5826695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5827228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a56520>}
2025-05-07T20:32:35.5828020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5828219Z context = <triton._C.libtriton.ir.context object at 0x7f8917b263b0>
2025-05-07T20:32:35.5828224Z 
2025-05-07T20:32:35.5828395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5828670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5828777Z                            module_map=module_map)
2025-05-07T20:32:35.5828943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5829044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5829119Z E       ^
2025-05-07T20:32:35.5829489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5829497Z 
2025-05-07T20:32:35.5829935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5829939Z 
2025-05-07T20:32:35.5830083Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5830318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5830394Z     T=1,
2025-05-07T20:32:35.5830473Z     D=7168,
2025-05-07T20:32:35.5830557Z     scale_ub=None,
2025-05-07T20:32:35.5830644Z     contiguous=False,
2025-05-07T20:32:35.5830727Z     compiled=False,
2025-05-07T20:32:35.5830804Z )
2025-05-07T20:32:35.5831030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5831200Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.5831207Z 
2025-05-07T20:32:35.5831283Z     @given(
2025-05-07T20:32:35.5831401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5831504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5831618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5831734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5831853Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5831925Z     )
2025-05-07T20:32:35.5832179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5832316Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5832392Z         self,
2025-05-07T20:32:35.5832467Z         T: int,
2025-05-07T20:32:35.5832547Z         D: int,
2025-05-07T20:32:35.5832682Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5832773Z         contiguous: bool,
2025-05-07T20:32:35.5832857Z         compiled: bool,
2025-05-07T20:32:35.5832934Z     ) -> None:
2025-05-07T20:32:35.5833031Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5833103Z     
2025-05-07T20:32:35.5833280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5833414Z     
2025-05-07T20:32:35.5833527Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5833654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5833750Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5833828Z         x0 = x[:, :D]
2025-05-07T20:32:35.5833904Z         x1 = x[:, D:]
2025-05-07T20:32:35.5833980Z     
2025-05-07T20:32:35.5834063Z         if contiguous:
2025-05-07T20:32:35.5834155Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5834241Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5834310Z     
2025-05-07T20:32:35.5834400Z         if scale_ub is not None:
2025-05-07T20:32:35.5834509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5834645Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5834722Z             )
2025-05-07T20:32:35.5834797Z         else:
2025-05-07T20:32:35.5834890Z             scale_ub_tensor = None
2025-05-07T20:32:35.5834964Z     
2025-05-07T20:32:35.5835095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5835183Z             op = silu_mul_quant
2025-05-07T20:32:35.5835268Z             if compiled:
2025-05-07T20:32:35.5835368Z                 op = torch.compile(op)
2025-05-07T20:32:35.5835474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5835545Z     
2025-05-07T20:32:35.5835635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5835642Z 
2025-05-07T20:32:35.5835742Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5835871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5835971Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5836076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5836596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5836692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5837071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5837300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5837730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5837825Z     kernel = self.compile(
2025-05-07T20:32:35.5838227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5838407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5838537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5838542Z 
2025-05-07T20:32:35.5838751Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917aa3b90>
2025-05-07T20:32:35.5839573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5840103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917a57100>}
2025-05-07T20:32:35.5840933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5841128Z context = <triton._C.libtriton.ir.context object at 0x7f8917862d30>
2025-05-07T20:32:35.5841168Z 
2025-05-07T20:32:35.5841342Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5841614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5841721Z                            module_map=module_map)
2025-05-07T20:32:35.5841924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5842023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5842103Z E       ^
2025-05-07T20:32:35.5842475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5842480Z 
2025-05-07T20:32:35.5842917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5842922Z 
2025-05-07T20:32:35.5843029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5843263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5843344Z     T=2048,
2025-05-07T20:32:35.5843439Z     D=7168,
2025-05-07T20:32:35.5843526Z     scale_ub=None,
2025-05-07T20:32:35.5843639Z     contiguous=False,
2025-05-07T20:32:35.5843724Z     compiled=True,
2025-05-07T20:32:35.5843796Z )
2025-05-07T20:32:35.5844029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5844207Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5844211Z 
2025-05-07T20:32:35.5844287Z     @given(
2025-05-07T20:32:35.5844411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5844509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5844625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5844744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5844856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5844933Z     )
2025-05-07T20:32:35.5845187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5845280Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5845359Z         self,
2025-05-07T20:32:35.5845434Z         T: int,
2025-05-07T20:32:35.5845509Z         D: int,
2025-05-07T20:32:35.5845610Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5845702Z         contiguous: bool,
2025-05-07T20:32:35.5845787Z         compiled: bool,
2025-05-07T20:32:35.5845866Z     ) -> None:
2025-05-07T20:32:35.5845959Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5846076Z     
2025-05-07T20:32:35.5846250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5846324Z     
2025-05-07T20:32:35.5846422Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5846547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5846634Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5846715Z         x0 = x[:, :D]
2025-05-07T20:32:35.5846795Z         x1 = x[:, D:]
2025-05-07T20:32:35.5846866Z     
2025-05-07T20:32:35.5846951Z         if contiguous:
2025-05-07T20:32:35.5847040Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5847127Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5847202Z     
2025-05-07T20:32:35.5847291Z         if scale_ub is not None:
2025-05-07T20:32:35.5847397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5847534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5847608Z             )
2025-05-07T20:32:35.5847687Z         else:
2025-05-07T20:32:35.5847782Z             scale_ub_tensor = None
2025-05-07T20:32:35.5847852Z     
2025-05-07T20:32:35.5847982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5848115Z             op = silu_mul_quant
2025-05-07T20:32:35.5848200Z             if compiled:
2025-05-07T20:32:35.5848301Z                 op = torch.compile(op)
2025-05-07T20:32:35.5848404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5848513Z     
2025-05-07T20:32:35.5848606Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5848610Z 
2025-05-07T20:32:35.5848705Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5848839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5848938Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5849075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5849459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5849553Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5850069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5850169Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5850545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5850781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5851133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5851226Z     kernel = self.compile(
2025-05-07T20:32:35.5851630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5851856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5851991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5851996Z 
2025-05-07T20:32:35.5852208Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5b410>
2025-05-07T20:32:35.5853029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5853563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b44720>}
2025-05-07T20:32:35.5854352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5854554Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8bee970>
2025-05-07T20:32:35.5854558Z 
2025-05-07T20:32:35.5854776Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5858599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5858783Z                            module_map=module_map)
2025-05-07T20:32:35.5859005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5859113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5859196Z E       ^
2025-05-07T20:32:35.5859632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5859638Z 
2025-05-07T20:32:35.5860075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5860084Z 
2025-05-07T20:32:35.5860186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5860415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5860498Z     T=4096,
2025-05-07T20:32:35.5860575Z     D=7168,
2025-05-07T20:32:35.5860656Z     scale_ub=None,
2025-05-07T20:32:35.5860746Z     contiguous=False,
2025-05-07T20:32:35.5860908Z     compiled=True,
2025-05-07T20:32:35.5860986Z )
2025-05-07T20:32:35.5861216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5861391Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5861437Z 
2025-05-07T20:32:35.5861515Z     @given(
2025-05-07T20:32:35.5861636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5861734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5861850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5862006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5862122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5862196Z     )
2025-05-07T20:32:35.5862456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5862548Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5862629Z         self,
2025-05-07T20:32:35.5862707Z         T: int,
2025-05-07T20:32:35.5862784Z         D: int,
2025-05-07T20:32:35.5862881Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5862968Z         contiguous: bool,
2025-05-07T20:32:35.5863054Z         compiled: bool,
2025-05-07T20:32:35.5863134Z     ) -> None:
2025-05-07T20:32:35.5863226Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5863299Z     
2025-05-07T20:32:35.5863472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5863546Z     
2025-05-07T20:32:35.5863640Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5863767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5863854Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5863935Z         x0 = x[:, :D]
2025-05-07T20:32:35.5864016Z         x1 = x[:, D:]
2025-05-07T20:32:35.5864092Z     
2025-05-07T20:32:35.5864174Z         if contiguous:
2025-05-07T20:32:35.5864261Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5864353Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5864428Z     
2025-05-07T20:32:35.5864518Z         if scale_ub is not None:
2025-05-07T20:32:35.5864626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5864762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5864838Z             )
2025-05-07T20:32:35.5864916Z         else:
2025-05-07T20:32:35.5865007Z             scale_ub_tensor = None
2025-05-07T20:32:35.5865078Z     
2025-05-07T20:32:35.5865209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5865298Z             op = silu_mul_quant
2025-05-07T20:32:35.5865384Z             if compiled:
2025-05-07T20:32:35.5865489Z                 op = torch.compile(op)
2025-05-07T20:32:35.5865593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5865668Z     
2025-05-07T20:32:35.5865806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5865811Z 
2025-05-07T20:32:35.5865910Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5866052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5866152Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5866250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5866635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5866732Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5867244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5867344Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5867711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5867947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5868298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5868433Z     kernel = self.compile(
2025-05-07T20:32:35.5868901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5869098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5869278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5869283Z 
2025-05-07T20:32:35.5869512Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b58f20>
2025-05-07T20:32:35.5870488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5871079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b45440>}
2025-05-07T20:32:35.5871854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5872059Z context = <triton._C.libtriton.ir.context object at 0x7f8ac8bf1d70>
2025-05-07T20:32:35.5872064Z 
2025-05-07T20:32:35.5872232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5872502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5872610Z                            module_map=module_map)
2025-05-07T20:32:35.5872772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5872875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5872954Z E       ^
2025-05-07T20:32:35.5873318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5873323Z 
2025-05-07T20:32:35.5873756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5873760Z 
2025-05-07T20:32:35.5873862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5874117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5874224Z     T=16384,
2025-05-07T20:32:35.5874334Z     D=5120,
2025-05-07T20:32:35.5874431Z     scale_ub=1200.0,
2025-05-07T20:32:35.5874519Z     contiguous=False,
2025-05-07T20:32:35.5874602Z     compiled=False,
2025-05-07T20:32:35.5874680Z )
2025-05-07T20:32:35.5874906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5875092Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5875156Z 
2025-05-07T20:32:35.5875234Z     @given(
2025-05-07T20:32:35.5875355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5875458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5875574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5875690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5875812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5875885Z     )
2025-05-07T20:32:35.5876138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5876235Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5876312Z         self,
2025-05-07T20:32:35.5876391Z         T: int,
2025-05-07T20:32:35.5876471Z         D: int,
2025-05-07T20:32:35.5876566Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5876656Z         contiguous: bool,
2025-05-07T20:32:35.5876741Z         compiled: bool,
2025-05-07T20:32:35.5876818Z     ) -> None:
2025-05-07T20:32:35.5876917Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5876986Z     
2025-05-07T20:32:35.5877155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5877278Z     
2025-05-07T20:32:35.5877372Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5877496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5877585Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5877703Z         x0 = x[:, :D]
2025-05-07T20:32:35.5877782Z         x1 = x[:, D:]
2025-05-07T20:32:35.5877856Z     
2025-05-07T20:32:35.5877940Z         if contiguous:
2025-05-07T20:32:35.5878034Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5878122Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5878234Z     
2025-05-07T20:32:35.5878325Z         if scale_ub is not None:
2025-05-07T20:32:35.5878429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5878564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5878647Z             )
2025-05-07T20:32:35.5878723Z         else:
2025-05-07T20:32:35.5878816Z             scale_ub_tensor = None
2025-05-07T20:32:35.5878892Z     
2025-05-07T20:32:35.5879022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5879112Z             op = silu_mul_quant
2025-05-07T20:32:35.5879197Z             if compiled:
2025-05-07T20:32:35.5879296Z                 op = torch.compile(op)
2025-05-07T20:32:35.5879401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5879475Z     
2025-05-07T20:32:35.5879563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5879568Z 
2025-05-07T20:32:35.5879665Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5879796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5879897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5879998Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5880515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5880611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5880982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5881209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5881562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5881654Z     kernel = self.compile(
2025-05-07T20:32:35.5882046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5882229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5882361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5882365Z 
2025-05-07T20:32:35.5882621Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5a660>
2025-05-07T20:32:35.5883427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5883946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b46340>}
2025-05-07T20:32:35.5884723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5884922Z context = <triton._C.libtriton.ir.context object at 0x7f89178e7df0>
2025-05-07T20:32:35.5884926Z 
2025-05-07T20:32:35.5885098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5885368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5885473Z                            module_map=module_map)
2025-05-07T20:32:35.5885680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5885780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5885859Z E       ^
2025-05-07T20:32:35.5886222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5886264Z 
2025-05-07T20:32:35.5886691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5886696Z 
2025-05-07T20:32:35.5886802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5887076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5887153Z     T=16384,
2025-05-07T20:32:35.5887229Z     D=5120,
2025-05-07T20:32:35.5887312Z     scale_ub=1200.0,
2025-05-07T20:32:35.5887402Z     contiguous=True,
2025-05-07T20:32:35.5887484Z     compiled=True,
2025-05-07T20:32:35.5887556Z )
2025-05-07T20:32:35.5887784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5887965Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5887969Z 
2025-05-07T20:32:35.5888044Z     @given(
2025-05-07T20:32:35.5888171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5888271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5888384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5888501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5888614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5888692Z     )
2025-05-07T20:32:35.5888940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5889033Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5889112Z         self,
2025-05-07T20:32:35.5889189Z         T: int,
2025-05-07T20:32:35.5889265Z         D: int,
2025-05-07T20:32:35.5889363Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5889454Z         contiguous: bool,
2025-05-07T20:32:35.5889540Z         compiled: bool,
2025-05-07T20:32:35.5889618Z     ) -> None:
2025-05-07T20:32:35.5889710Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5889783Z     
2025-05-07T20:32:35.5889955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5890029Z     
2025-05-07T20:32:35.5890124Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5890247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5890333Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5890417Z         x0 = x[:, :D]
2025-05-07T20:32:35.5890495Z         x1 = x[:, D:]
2025-05-07T20:32:35.5890566Z     
2025-05-07T20:32:35.5890653Z         if contiguous:
2025-05-07T20:32:35.5890742Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5890879Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5890956Z     
2025-05-07T20:32:35.5891045Z         if scale_ub is not None:
2025-05-07T20:32:35.5891151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5891293Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5891368Z             )
2025-05-07T20:32:35.5891449Z         else:
2025-05-07T20:32:35.5891543Z             scale_ub_tensor = None
2025-05-07T20:32:35.5891613Z     
2025-05-07T20:32:35.5891743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5891925Z             op = silu_mul_quant
2025-05-07T20:32:35.5892010Z             if compiled:
2025-05-07T20:32:35.5892111Z                 op = torch.compile(op)
2025-05-07T20:32:35.5892217Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5892287Z     
2025-05-07T20:32:35.5892381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5892386Z 
2025-05-07T20:32:35.5892480Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5892611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5892713Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5892858Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5893238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5893372Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5893878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5893978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5894344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5894610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5894961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5895055Z     kernel = self.compile(
2025-05-07T20:32:35.5895456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5895634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5895763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5895769Z 
2025-05-07T20:32:35.5895980Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5a1b0>
2025-05-07T20:32:35.5896782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5897311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8ac8b479c0>}
2025-05-07T20:32:35.5898084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5898281Z context = <triton._C.libtriton.ir.context object at 0x7f8ac856de70>
2025-05-07T20:32:35.5898287Z 
2025-05-07T20:32:35.5898453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5898723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5898832Z                            module_map=module_map)
2025-05-07T20:32:35.5898999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5899098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5899176Z E       ^
2025-05-07T20:32:35.5899586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5899591Z 
2025-05-07T20:32:35.5900028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5900032Z 
2025-05-07T20:32:35.5900135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5900362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5900445Z     T=16384,
2025-05-07T20:32:35.5900523Z     D=5120,
2025-05-07T20:32:35.5900605Z     scale_ub=None,
2025-05-07T20:32:35.5900695Z     contiguous=False,
2025-05-07T20:32:35.5900778Z     compiled=True,
2025-05-07T20:32:35.5900851Z )
2025-05-07T20:32:35.5901075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5901259Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5901263Z 
2025-05-07T20:32:35.5901342Z     @given(
2025-05-07T20:32:35.5901465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5901563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5901680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5901841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5901957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5902029Z     )
2025-05-07T20:32:35.5902280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5902455Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5902533Z         self,
2025-05-07T20:32:35.5902607Z         T: int,
2025-05-07T20:32:35.5902685Z         D: int,
2025-05-07T20:32:35.5902782Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5902870Z         contiguous: bool,
2025-05-07T20:32:35.5903001Z         compiled: bool,
2025-05-07T20:32:35.5903078Z     ) -> None:
2025-05-07T20:32:35.5903171Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5903246Z     
2025-05-07T20:32:35.5903417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5903490Z     
2025-05-07T20:32:35.5903585Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5903710Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5903802Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5903881Z         x0 = x[:, :D]
2025-05-07T20:32:35.5903956Z         x1 = x[:, D:]
2025-05-07T20:32:35.5904035Z     
2025-05-07T20:32:35.5904119Z         if contiguous:
2025-05-07T20:32:35.5904207Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5904297Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5904368Z     
2025-05-07T20:32:35.5904456Z         if scale_ub is not None:
2025-05-07T20:32:35.5904561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5904698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5904772Z             )
2025-05-07T20:32:35.5904852Z         else:
2025-05-07T20:32:35.5904947Z             scale_ub_tensor = None
2025-05-07T20:32:35.5905025Z     
2025-05-07T20:32:35.5905154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5905245Z             op = silu_mul_quant
2025-05-07T20:32:35.5905335Z             if compiled:
2025-05-07T20:32:35.5905433Z                 op = torch.compile(op)
2025-05-07T20:32:35.5905537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5905611Z     
2025-05-07T20:32:35.5905703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5905707Z 
2025-05-07T20:32:35.5905802Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5905937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5906035Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5906314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5906841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5906968Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5907581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5907684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5908053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5908285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5908636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5908734Z     kernel = self.compile(
2025-05-07T20:32:35.5909129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5909311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5909445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5909452Z 
2025-05-07T20:32:35.5909659Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8510aa0>
2025-05-07T20:32:35.5910537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5911109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7cc20>}
2025-05-07T20:32:35.5911881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5912138Z context = <triton._C.libtriton.ir.context object at 0x7f8917c000f0>
2025-05-07T20:32:35.5912142Z 
2025-05-07T20:32:35.5912314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5912589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5912698Z                            module_map=module_map)
2025-05-07T20:32:35.5912860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5912961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5913039Z E       ^
2025-05-07T20:32:35.5913404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5913412Z 
2025-05-07T20:32:35.5913838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5913845Z 
2025-05-07T20:32:35.5913948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5914177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5914254Z     T=2048,
2025-05-07T20:32:35.5914333Z     D=5120,
2025-05-07T20:32:35.5914418Z     scale_ub=None,
2025-05-07T20:32:35.5914504Z     contiguous=False,
2025-05-07T20:32:35.5914586Z     compiled=True,
2025-05-07T20:32:35.5914664Z )
2025-05-07T20:32:35.5914888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5915068Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.5915075Z 
2025-05-07T20:32:35.5915151Z     @given(
2025-05-07T20:32:35.5915270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5915371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5915485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5915602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5915720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5915792Z     )
2025-05-07T20:32:35.5916046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5916187Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5916264Z         self,
2025-05-07T20:32:35.5916342Z         T: int,
2025-05-07T20:32:35.5916418Z         D: int,
2025-05-07T20:32:35.5916516Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5916608Z         contiguous: bool,
2025-05-07T20:32:35.5916693Z         compiled: bool,
2025-05-07T20:32:35.5916770Z     ) -> None:
2025-05-07T20:32:35.5916869Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5916940Z     
2025-05-07T20:32:35.5917108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5917185Z     
2025-05-07T20:32:35.5917277Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5917401Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5917493Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5917572Z         x0 = x[:, :D]
2025-05-07T20:32:35.5917653Z         x1 = x[:, D:]
2025-05-07T20:32:35.5917725Z     
2025-05-07T20:32:35.5917805Z         if contiguous:
2025-05-07T20:32:35.5917901Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5917994Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5918066Z     
2025-05-07T20:32:35.5918203Z         if scale_ub is not None:
2025-05-07T20:32:35.5918308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5918443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5918562Z             )
2025-05-07T20:32:35.5918641Z         else:
2025-05-07T20:32:35.5918738Z             scale_ub_tensor = None
2025-05-07T20:32:35.5918814Z     
2025-05-07T20:32:35.5918942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5919031Z             op = silu_mul_quant
2025-05-07T20:32:35.5919118Z             if compiled:
2025-05-07T20:32:35.5919259Z                 op = torch.compile(op)
2025-05-07T20:32:35.5919365Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5919435Z     
2025-05-07T20:32:35.5919525Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5919532Z 
2025-05-07T20:32:35.5919629Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5919760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5919864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5919964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5920341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5920436Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5920949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5921045Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5921422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5921648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5921999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5922095Z     kernel = self.compile(
2025-05-07T20:32:35.5922491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5922696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5922848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5922853Z 
2025-05-07T20:32:35.5923066Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8510ef0>
2025-05-07T20:32:35.5923870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5924436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7d9e0>}
2025-05-07T20:32:35.5925220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5925412Z context = <triton._C.libtriton.ir.context object at 0x7f8917c5a6f0>
2025-05-07T20:32:35.5925419Z 
2025-05-07T20:32:35.5925586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5925857Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5925962Z                            module_map=module_map)
2025-05-07T20:32:35.5926130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5926228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5926305Z E       ^
2025-05-07T20:32:35.5926674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5926679Z 
2025-05-07T20:32:35.5927231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5927236Z 
2025-05-07T20:32:35.5927343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5927611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5927689Z     T=2048,
2025-05-07T20:32:35.5927766Z     D=5120,
2025-05-07T20:32:35.5927848Z     scale_ub=1200.0,
2025-05-07T20:32:35.5927931Z     contiguous=False,
2025-05-07T20:32:35.5928018Z     compiled=True,
2025-05-07T20:32:35.5928088Z )
2025-05-07T20:32:35.5928353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5928536Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5928541Z 
2025-05-07T20:32:35.5928620Z     @given(
2025-05-07T20:32:35.5928741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5928841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5928955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5929073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5929185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5929259Z     )
2025-05-07T20:32:35.5929512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5929605Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5929681Z         self,
2025-05-07T20:32:35.5929761Z         T: int,
2025-05-07T20:32:35.5929836Z         D: int,
2025-05-07T20:32:35.5929936Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5930028Z         contiguous: bool,
2025-05-07T20:32:35.5930113Z         compiled: bool,
2025-05-07T20:32:35.5930194Z     ) -> None:
2025-05-07T20:32:35.5930286Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5930357Z     
2025-05-07T20:32:35.5930527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5930602Z     
2025-05-07T20:32:35.5930696Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5930823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5930910Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5930986Z         x0 = x[:, :D]
2025-05-07T20:32:35.5931069Z         x1 = x[:, D:]
2025-05-07T20:32:35.5931139Z     
2025-05-07T20:32:35.5931220Z         if contiguous:
2025-05-07T20:32:35.5931313Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5931400Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5931475Z     
2025-05-07T20:32:35.5931565Z         if scale_ub is not None:
2025-05-07T20:32:35.5931672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5931863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5931939Z             )
2025-05-07T20:32:35.5932013Z         else:
2025-05-07T20:32:35.5932159Z             scale_ub_tensor = None
2025-05-07T20:32:35.5932234Z     
2025-05-07T20:32:35.5932363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5932458Z             op = silu_mul_quant
2025-05-07T20:32:35.5932541Z             if compiled:
2025-05-07T20:32:35.5932638Z                 op = torch.compile(op)
2025-05-07T20:32:35.5932749Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5932822Z     
2025-05-07T20:32:35.5932917Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5932921Z 
2025-05-07T20:32:35.5933018Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5933149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5933253Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5933356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5933734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5933832Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5934411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5934510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5934879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5935146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5935504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5935595Z     kernel = self.compile(
2025-05-07T20:32:35.5935990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5936209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5936342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5936346Z 
2025-05-07T20:32:35.5936560Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac85139e0>
2025-05-07T20:32:35.5937365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5937887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c7eb60>}
2025-05-07T20:32:35.5938665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5938860Z context = <triton._C.libtriton.ir.context object at 0x7f891782e530>
2025-05-07T20:32:35.5938867Z 
2025-05-07T20:32:35.5939038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5939310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5939421Z                            module_map=module_map)
2025-05-07T20:32:35.5939583Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5939682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5939759Z E       ^
2025-05-07T20:32:35.5940123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5940127Z 
2025-05-07T20:32:35.5940554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5940561Z 
2025-05-07T20:32:35.5940666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5940937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5941018Z     T=4096,
2025-05-07T20:32:35.5941092Z     D=5120,
2025-05-07T20:32:35.5941175Z     scale_ub=1200.0,
2025-05-07T20:32:35.5941264Z     contiguous=True,
2025-05-07T20:32:35.5941345Z     compiled=True,
2025-05-07T20:32:35.5941416Z )
2025-05-07T20:32:35.5941643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5941820Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5941825Z 
2025-05-07T20:32:35.5941903Z     @given(
2025-05-07T20:32:35.5942025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5942123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5942239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5942357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5942471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5942551Z     )
2025-05-07T20:32:35.5942850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5942941Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5943019Z         self,
2025-05-07T20:32:35.5943137Z         T: int,
2025-05-07T20:32:35.5943215Z         D: int,
2025-05-07T20:32:35.5943316Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5943403Z         contiguous: bool,
2025-05-07T20:32:35.5943525Z         compiled: bool,
2025-05-07T20:32:35.5943604Z     ) -> None:
2025-05-07T20:32:35.5943699Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5943771Z     
2025-05-07T20:32:35.5943939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5944011Z     
2025-05-07T20:32:35.5944106Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5944269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5944355Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5944436Z         x0 = x[:, :D]
2025-05-07T20:32:35.5944515Z         x1 = x[:, D:]
2025-05-07T20:32:35.5944590Z     
2025-05-07T20:32:35.5944677Z         if contiguous:
2025-05-07T20:32:35.5944765Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5944852Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5944926Z     
2025-05-07T20:32:35.5945014Z         if scale_ub is not None:
2025-05-07T20:32:35.5945122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5945257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5945334Z             )
2025-05-07T20:32:35.5945418Z         else:
2025-05-07T20:32:35.5945509Z             scale_ub_tensor = None
2025-05-07T20:32:35.5945579Z     
2025-05-07T20:32:35.5945709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5945798Z             op = silu_mul_quant
2025-05-07T20:32:35.5945884Z             if compiled:
2025-05-07T20:32:35.5945984Z                 op = torch.compile(op)
2025-05-07T20:32:35.5946087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5946158Z     
2025-05-07T20:32:35.5946254Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5946258Z 
2025-05-07T20:32:35.5946354Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5946489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5946588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5946685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5947070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5947163Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5947670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5947774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5948140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5948417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5948766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5948862Z     kernel = self.compile(
2025-05-07T20:32:35.5949260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5949439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5949568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5949576Z 
2025-05-07T20:32:35.5949783Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c8710>
2025-05-07T20:32:35.5950587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5951113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917880180>}
2025-05-07T20:32:35.5951926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5952161Z context = <triton._C.libtriton.ir.context object at 0x7f89178077f0>
2025-05-07T20:32:35.5952166Z 
2025-05-07T20:32:35.5952331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5952598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5952747Z                            module_map=module_map)
2025-05-07T20:32:35.5952909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5953009Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5953088Z E       ^
2025-05-07T20:32:35.5953452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5953458Z 
2025-05-07T20:32:35.5953888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5953892Z 
2025-05-07T20:32:35.5953999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5954226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5954306Z     T=128,
2025-05-07T20:32:35.5954381Z     D=5120,
2025-05-07T20:32:35.5954468Z     scale_ub=1200.0,
2025-05-07T20:32:35.5954553Z     contiguous=False,
2025-05-07T20:32:35.5954642Z     compiled=True,
2025-05-07T20:32:35.5954715Z )
2025-05-07T20:32:35.5954938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5955115Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5955120Z 
2025-05-07T20:32:35.5955200Z     @given(
2025-05-07T20:32:35.5955317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5955418Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5955534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5955652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5955775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5955847Z     )
2025-05-07T20:32:35.5956100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5956195Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5956270Z         self,
2025-05-07T20:32:35.5956345Z         T: int,
2025-05-07T20:32:35.5956425Z         D: int,
2025-05-07T20:32:35.5956521Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5956609Z         contiguous: bool,
2025-05-07T20:32:35.5956696Z         compiled: bool,
2025-05-07T20:32:35.5956773Z     ) -> None:
2025-05-07T20:32:35.5956911Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5956989Z     
2025-05-07T20:32:35.5957159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5957233Z     
2025-05-07T20:32:35.5957324Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5957448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5957538Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5957619Z         x0 = x[:, :D]
2025-05-07T20:32:35.5957694Z         x1 = x[:, D:]
2025-05-07T20:32:35.5957769Z     
2025-05-07T20:32:35.5957852Z         if contiguous:
2025-05-07T20:32:35.5957941Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5958031Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5958109Z     
2025-05-07T20:32:35.5958197Z         if scale_ub is not None:
2025-05-07T20:32:35.5958306Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5958441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5958520Z             )
2025-05-07T20:32:35.5958595Z         else:
2025-05-07T20:32:35.5958688Z             scale_ub_tensor = None
2025-05-07T20:32:35.5958763Z     
2025-05-07T20:32:35.5958934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5959025Z             op = silu_mul_quant
2025-05-07T20:32:35.5959109Z             if compiled:
2025-05-07T20:32:35.5959206Z                 op = torch.compile(op)
2025-05-07T20:32:35.5959347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5959421Z     
2025-05-07T20:32:35.5959510Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5959514Z 
2025-05-07T20:32:35.5959611Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5959743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5959880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5959984Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5960364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5960455Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5960968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5961065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5961429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5961665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5962012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5962111Z     kernel = self.compile(
2025-05-07T20:32:35.5962508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5962686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5962818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5962823Z 
2025-05-07T20:32:35.5963034Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c8740>
2025-05-07T20:32:35.5963890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5964410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917880ea0>}
2025-05-07T20:32:35.5965187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5965452Z context = <triton._C.libtriton.ir.context object at 0x7f8917801530>
2025-05-07T20:32:35.5965456Z 
2025-05-07T20:32:35.5965625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5965897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5966002Z                            module_map=module_map)
2025-05-07T20:32:35.5966167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5966266Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5966343Z E       ^
2025-05-07T20:32:35.5966711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5966716Z 
2025-05-07T20:32:35.5967145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5967149Z 
2025-05-07T20:32:35.5967250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5967485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5967563Z     T=16384,
2025-05-07T20:32:35.5967638Z     D=7168,
2025-05-07T20:32:35.5967771Z     scale_ub=1200.0,
2025-05-07T20:32:35.5967858Z     contiguous=True,
2025-05-07T20:32:35.5967941Z     compiled=True,
2025-05-07T20:32:35.5968013Z )
2025-05-07T20:32:35.5968237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5968459Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5968463Z 
2025-05-07T20:32:35.5968540Z     @given(
2025-05-07T20:32:35.5968659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5968762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5968913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5969032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5969148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5969225Z     )
2025-05-07T20:32:35.5969481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5969576Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5969652Z         self,
2025-05-07T20:32:35.5969730Z         T: int,
2025-05-07T20:32:35.5969806Z         D: int,
2025-05-07T20:32:35.5969903Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5969998Z         contiguous: bool,
2025-05-07T20:32:35.5970082Z         compiled: bool,
2025-05-07T20:32:35.5970158Z     ) -> None:
2025-05-07T20:32:35.5970254Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5970326Z     
2025-05-07T20:32:35.5970495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5970573Z     
2025-05-07T20:32:35.5970665Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5970793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5970881Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5970963Z         x0 = x[:, :D]
2025-05-07T20:32:35.5971044Z         x1 = x[:, D:]
2025-05-07T20:32:35.5971114Z     
2025-05-07T20:32:35.5971196Z         if contiguous:
2025-05-07T20:32:35.5971291Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5971378Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5971450Z     
2025-05-07T20:32:35.5971544Z         if scale_ub is not None:
2025-05-07T20:32:35.5971647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5971784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5971913Z             )
2025-05-07T20:32:35.5971991Z         else:
2025-05-07T20:32:35.5972089Z             scale_ub_tensor = None
2025-05-07T20:32:35.5972162Z     
2025-05-07T20:32:35.5972296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5972396Z             op = silu_mul_quant
2025-05-07T20:32:35.5972480Z             if compiled:
2025-05-07T20:32:35.5972583Z                 op = torch.compile(op)
2025-05-07T20:32:35.5972743Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5972824Z     
2025-05-07T20:32:35.5972934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5972938Z 
2025-05-07T20:32:35.5973059Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5973205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5973308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5973416Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5973853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5973949Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5974552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5974655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5975148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5975393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5975799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5975893Z     kernel = self.compile(
2025-05-07T20:32:35.5976284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5976502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5976630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5976635Z 
2025-05-07T20:32:35.5976843Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178c9220>
2025-05-07T20:32:35.5977693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5978213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89178820c0>}
2025-05-07T20:32:35.5978989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5979184Z context = <triton._C.libtriton.ir.context object at 0x7f89177e85b0>
2025-05-07T20:32:35.5979189Z 
2025-05-07T20:32:35.5979359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5979627Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5979738Z                            module_map=module_map)
2025-05-07T20:32:35.5979902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5980003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5980079Z E       ^
2025-05-07T20:32:35.5980451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5980455Z 
2025-05-07T20:32:35.5980881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5980887Z 
2025-05-07T20:32:35.5980996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5984625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5984716Z     T=16384,
2025-05-07T20:32:35.5984799Z     D=5120,
2025-05-07T20:32:35.5984882Z     scale_ub=1200.0,
2025-05-07T20:32:35.5984970Z     contiguous=True,
2025-05-07T20:32:35.5985058Z     compiled=False,
2025-05-07T20:32:35.5985131Z )
2025-05-07T20:32:35.5985365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5985614Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.5985618Z 
2025-05-07T20:32:35.5985698Z     @given(
2025-05-07T20:32:35.5985825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5985926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5986039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5986162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5986275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5986352Z     )
2025-05-07T20:32:35.5986608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5986702Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5986783Z         self,
2025-05-07T20:32:35.5986860Z         T: int,
2025-05-07T20:32:35.5986935Z         D: int,
2025-05-07T20:32:35.5987036Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5987126Z         contiguous: bool,
2025-05-07T20:32:35.5987215Z         compiled: bool,
2025-05-07T20:32:35.5987298Z     ) -> None:
2025-05-07T20:32:35.5987395Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5987467Z     
2025-05-07T20:32:35.5987698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5987772Z     
2025-05-07T20:32:35.5987867Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5987993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5988125Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5988207Z         x0 = x[:, :D]
2025-05-07T20:32:35.5988286Z         x1 = x[:, D:]
2025-05-07T20:32:35.5988358Z     
2025-05-07T20:32:35.5988444Z         if contiguous:
2025-05-07T20:32:35.5988534Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5988664Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5988739Z     
2025-05-07T20:32:35.5988828Z         if scale_ub is not None:
2025-05-07T20:32:35.5988935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5989075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5989153Z             )
2025-05-07T20:32:35.5989233Z         else:
2025-05-07T20:32:35.5989329Z             scale_ub_tensor = None
2025-05-07T20:32:35.5989400Z     
2025-05-07T20:32:35.5989533Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5989624Z             op = silu_mul_quant
2025-05-07T20:32:35.5989712Z             if compiled:
2025-05-07T20:32:35.5989814Z                 op = torch.compile(op)
2025-05-07T20:32:35.5989920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5989992Z     
2025-05-07T20:32:35.5990087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5990091Z 
2025-05-07T20:32:35.5990189Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5990324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5990428Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5990526Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5991045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5991145Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5991513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5991742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5992097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5992196Z     kernel = self.compile(
2025-05-07T20:32:35.5992593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5992772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5992904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5992955Z 
2025-05-07T20:32:35.5993164Z self = <triton.compiler.compiler.ASTSource object at 0x7f8ac8b5b860>
2025-05-07T20:32:35.5993970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5994488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917881a80>}
2025-05-07T20:32:35.5995262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5995460Z context = <triton._C.libtriton.ir.context object at 0x7f89177f2230>
2025-05-07T20:32:35.5995464Z 
2025-05-07T20:32:35.5995636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5995908Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5996057Z                            module_map=module_map)
2025-05-07T20:32:35.5996224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5996327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5996446Z E       ^
2025-05-07T20:32:35.5996811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5996819Z 
2025-05-07T20:32:35.5997244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5997313Z 
2025-05-07T20:32:35.5997416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5997646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5997726Z     T=1,
2025-05-07T20:32:35.5997804Z     D=7168,
2025-05-07T20:32:35.5997890Z     scale_ub=1200.0,
2025-05-07T20:32:35.5997975Z     contiguous=False,
2025-05-07T20:32:35.5998057Z     compiled=False,
2025-05-07T20:32:35.5998133Z )
2025-05-07T20:32:35.5998357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5998531Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.5998538Z 
2025-05-07T20:32:35.5998615Z     @given(
2025-05-07T20:32:35.5998734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5998838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5998951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5999068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5999186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5999261Z     )
2025-05-07T20:32:35.5999512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5999609Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5999685Z         self,
2025-05-07T20:32:35.5999767Z         T: int,
2025-05-07T20:32:35.5999845Z         D: int,
2025-05-07T20:32:35.5999945Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6000036Z         contiguous: bool,
2025-05-07T20:32:35.6000121Z         compiled: bool,
2025-05-07T20:32:35.6000201Z     ) -> None:
2025-05-07T20:32:35.6000302Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6000374Z     
2025-05-07T20:32:35.6000544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6000623Z     
2025-05-07T20:32:35.6000717Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6000841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6000936Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6001017Z         x0 = x[:, :D]
2025-05-07T20:32:35.6001100Z         x1 = x[:, D:]
2025-05-07T20:32:35.6001172Z     
2025-05-07T20:32:35.6001303Z         if contiguous:
2025-05-07T20:32:35.6001400Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6001489Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6001561Z     
2025-05-07T20:32:35.6001658Z         if scale_ub is not None:
2025-05-07T20:32:35.6001763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6001900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6001981Z             )
2025-05-07T20:32:35.6002057Z         else:
2025-05-07T20:32:35.6002150Z             scale_ub_tensor = None
2025-05-07T20:32:35.6002226Z     
2025-05-07T20:32:35.6002355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6002446Z             op = silu_mul_quant
2025-05-07T20:32:35.6002533Z             if compiled:
2025-05-07T20:32:35.6002634Z                 op = torch.compile(op)
2025-05-07T20:32:35.6002741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6002812Z     
2025-05-07T20:32:35.6002902Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6002909Z 
2025-05-07T20:32:35.6003014Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6003192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6003311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6003425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6003959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6004095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6004464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6004690Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6005080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6005175Z     kernel = self.compile(
2025-05-07T20:32:35.6005567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6005749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6005882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6005886Z 
2025-05-07T20:32:35.6006095Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177560c0>
2025-05-07T20:32:35.6007240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6007772Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177080e0>}
2025-05-07T20:32:35.6008549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6008746Z context = <triton._C.libtriton.ir.context object at 0x7f89179e8770>
2025-05-07T20:32:35.6008751Z 
2025-05-07T20:32:35.6008921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6009192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6009302Z                            module_map=module_map)
2025-05-07T20:32:35.6009465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6009573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6009655Z E       ^
2025-05-07T20:32:35.6010020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6010025Z 
2025-05-07T20:32:35.6010548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6010553Z 
2025-05-07T20:32:35.6010663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6010892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6010972Z     T=4096,
2025-05-07T20:32:35.6011051Z     D=7168,
2025-05-07T20:32:35.6011138Z     scale_ub=1200.0,
2025-05-07T20:32:35.6011234Z     contiguous=False,
2025-05-07T20:32:35.6011318Z     compiled=True,
2025-05-07T20:32:35.6011389Z )
2025-05-07T20:32:35.6011616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6011795Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.6011802Z 
2025-05-07T20:32:35.6011940Z     @given(
2025-05-07T20:32:35.6012067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6012164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6012284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6012401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6012577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6012656Z     )
2025-05-07T20:32:35.6012907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6012999Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6013133Z         self,
2025-05-07T20:32:35.6013210Z         T: int,
2025-05-07T20:32:35.6013286Z         D: int,
2025-05-07T20:32:35.6013386Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6013492Z         contiguous: bool,
2025-05-07T20:32:35.6013587Z         compiled: bool,
2025-05-07T20:32:35.6013684Z     ) -> None:
2025-05-07T20:32:35.6013846Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6013920Z     
2025-05-07T20:32:35.6014089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6014161Z     
2025-05-07T20:32:35.6014257Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6014381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6014469Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6014552Z         x0 = x[:, :D]
2025-05-07T20:32:35.6014631Z         x1 = x[:, D:]
2025-05-07T20:32:35.6014703Z     
2025-05-07T20:32:35.6014788Z         if contiguous:
2025-05-07T20:32:35.6014877Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6014967Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6015042Z     
2025-05-07T20:32:35.6015130Z         if scale_ub is not None:
2025-05-07T20:32:35.6015239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6015374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6015449Z             )
2025-05-07T20:32:35.6015526Z         else:
2025-05-07T20:32:35.6015617Z             scale_ub_tensor = None
2025-05-07T20:32:35.6015687Z     
2025-05-07T20:32:35.6015820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6015910Z             op = silu_mul_quant
2025-05-07T20:32:35.6015995Z             if compiled:
2025-05-07T20:32:35.6016098Z                 op = torch.compile(op)
2025-05-07T20:32:35.6016204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6016277Z     
2025-05-07T20:32:35.6016370Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6016375Z 
2025-05-07T20:32:35.6016470Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6016605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6016704Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6016801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6017181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.6017275Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.6017832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6017934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6018302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6018533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6018882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6018978Z     kernel = self.compile(
2025-05-07T20:32:35.6019374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6019551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6019684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6019692Z 
2025-05-07T20:32:35.6019900Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917755a30>
2025-05-07T20:32:35.6020747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6021267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917709300>}
2025-05-07T20:32:35.6022080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6022277Z context = <triton._C.libtriton.ir.context object at 0x7f891791d530>
2025-05-07T20:32:35.6022319Z 
2025-05-07T20:32:35.6022488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6022764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6022873Z                            module_map=module_map)
2025-05-07T20:32:35.6023040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6023142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6023217Z E       ^
2025-05-07T20:32:35.6023586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6023593Z 
2025-05-07T20:32:35.6024032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6024036Z 
2025-05-07T20:32:35.6024139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6024372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6024453Z     T=128,
2025-05-07T20:32:35.6024528Z     D=7168,
2025-05-07T20:32:35.6024615Z     scale_ub=1200.0,
2025-05-07T20:32:35.6024703Z     contiguous=False,
2025-05-07T20:32:35.6024786Z     compiled=True,
2025-05-07T20:32:35.6024859Z )
2025-05-07T20:32:35.6025088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6025265Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.6025270Z 
2025-05-07T20:32:35.6025350Z     @given(
2025-05-07T20:32:35.6025470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6025568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6025685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6025803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6025921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6025996Z     )
2025-05-07T20:32:35.6026251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6026344Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6026421Z         self,
2025-05-07T20:32:35.6026541Z         T: int,
2025-05-07T20:32:35.6026622Z         D: int,
2025-05-07T20:32:35.6026719Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6026809Z         contiguous: bool,
2025-05-07T20:32:35.6026896Z         compiled: bool,
2025-05-07T20:32:35.6026974Z     ) -> None:
2025-05-07T20:32:35.6027065Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6027138Z     
2025-05-07T20:32:35.6027310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6027385Z     
2025-05-07T20:32:35.6027476Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6027599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6027688Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6027767Z         x0 = x[:, :D]
2025-05-07T20:32:35.6027848Z         x1 = x[:, D:]
2025-05-07T20:32:35.6027919Z     
2025-05-07T20:32:35.6028002Z         if contiguous:
2025-05-07T20:32:35.6028091Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6028184Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6028256Z     
2025-05-07T20:32:35.6028345Z         if scale_ub is not None:
2025-05-07T20:32:35.6028452Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6028633Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6028714Z             )
2025-05-07T20:32:35.6028792Z         else:
2025-05-07T20:32:35.6028884Z             scale_ub_tensor = None
2025-05-07T20:32:35.6029024Z     
2025-05-07T20:32:35.6029154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6029242Z             op = silu_mul_quant
2025-05-07T20:32:35.6029330Z             if compiled:
2025-05-07T20:32:35.6029426Z                 op = torch.compile(op)
2025-05-07T20:32:35.6029528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6029643Z     
2025-05-07T20:32:35.6029731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6029736Z 
2025-05-07T20:32:35.6029835Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6029975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6030073Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6030177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6030563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.6030654Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.6031178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6031274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6031645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6031881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6032237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6032336Z     kernel = self.compile(
2025-05-07T20:32:35.6032735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6032912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6033047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6033053Z 
2025-05-07T20:32:35.6033262Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917756540>
2025-05-07T20:32:35.6034089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6034615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891770a020>}
2025-05-07T20:32:35.6035448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6035646Z context = <triton._C.libtriton.ir.context object at 0x7f89175c03b0>
2025-05-07T20:32:35.6035650Z 
2025-05-07T20:32:35.6035821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6036099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6036206Z                            module_map=module_map)
2025-05-07T20:32:35.6036369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6036473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6036549Z E       ^
2025-05-07T20:32:35.6036924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6036931Z 
2025-05-07T20:32:35.6037367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6037413Z 
2025-05-07T20:32:35.6037516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6037750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6037825Z     T=2048,
2025-05-07T20:32:35.6037944Z     D=7168,
2025-05-07T20:32:35.6038032Z     scale_ub=None,
2025-05-07T20:32:35.6038116Z     contiguous=True,
2025-05-07T20:32:35.6038200Z     compiled=True,
2025-05-07T20:32:35.6038272Z )
2025-05-07T20:32:35.6038499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6038717Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.6038721Z 
2025-05-07T20:32:35.6038795Z     @given(
2025-05-07T20:32:35.6038913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6039016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6039131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6039250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6039366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6039438Z     )
2025-05-07T20:32:35.6039698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6039792Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6039868Z         self,
2025-05-07T20:32:35.6039946Z         T: int,
2025-05-07T20:32:35.6040021Z         D: int,
2025-05-07T20:32:35.6040118Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6040210Z         contiguous: bool,
2025-05-07T20:32:35.6040298Z         compiled: bool,
2025-05-07T20:32:35.6040374Z     ) -> None:
2025-05-07T20:32:35.6040470Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6040540Z     
2025-05-07T20:32:35.6040711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6040787Z     
2025-05-07T20:32:35.6040877Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6041003Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6041093Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6041170Z         x0 = x[:, :D]
2025-05-07T20:32:35.6041250Z         x1 = x[:, D:]
2025-05-07T20:32:35.6041322Z     
2025-05-07T20:32:35.6041407Z         if contiguous:
2025-05-07T20:32:35.6041499Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6041585Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6041656Z     
2025-05-07T20:32:35.6041749Z         if scale_ub is not None:
2025-05-07T20:32:35.6041852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6041985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6042066Z             )
2025-05-07T20:32:35.6042141Z         else:
2025-05-07T20:32:35.6042237Z             scale_ub_tensor = None
2025-05-07T20:32:35.6042309Z     
2025-05-07T20:32:35.6042485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6042577Z             op = silu_mul_quant
2025-05-07T20:32:35.6042665Z             if compiled:
2025-05-07T20:32:35.6042765Z                 op = torch.compile(op)
2025-05-07T20:32:35.6042871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6042942Z     
2025-05-07T20:32:35.6043032Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6043038Z 
2025-05-07T20:32:35.6043136Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6043267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6043366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6043467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6043850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.6043943Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.6044462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6044559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6044980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6045211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6045607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6045702Z     kernel = self.compile(
2025-05-07T20:32:35.6046103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6046320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6046450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6046454Z 
2025-05-07T20:32:35.6046667Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175d8320>
2025-05-07T20:32:35.6047485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6048013Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891770b240>}
2025-05-07T20:32:35.6048808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6049005Z context = <triton._C.libtriton.ir.context object at 0x7f89175934b0>
2025-05-07T20:32:35.6049010Z 
2025-05-07T20:32:35.6049179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6049456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6049563Z                            module_map=module_map)
2025-05-07T20:32:35.6049730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6049827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6049901Z E       ^
2025-05-07T20:32:35.6050276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6050281Z 
2025-05-07T20:32:35.6050715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6050719Z 
2025-05-07T20:32:35.6050830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6051061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6051137Z     T=16384,
2025-05-07T20:32:35.6051214Z     D=5120,
2025-05-07T20:32:35.6051337Z     scale_ub=None,
2025-05-07T20:32:35.6051424Z     contiguous=False,
2025-05-07T20:32:35.6051510Z     compiled=False,
2025-05-07T20:32:35.6051581Z )
2025-05-07T20:32:35.6051875Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6052068Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.6052073Z 
2025-05-07T20:32:35.6052153Z     @given(
2025-05-07T20:32:35.6052275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6052373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6052487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6052607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6052721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6052795Z     )
2025-05-07T20:32:35.6053055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6053150Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6053228Z         self,
2025-05-07T20:32:35.6053308Z         T: int,
2025-05-07T20:32:35.6053383Z         D: int,
2025-05-07T20:32:35.6053553Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6053661Z         contiguous: bool,
2025-05-07T20:32:35.6053752Z         compiled: bool,
2025-05-07T20:32:35.6053830Z     ) -> None:
2025-05-07T20:32:35.6053923Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6054033Z     
2025-05-07T20:32:35.6054205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6054277Z     
2025-05-07T20:32:35.6054368Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6054497Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6056475Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6056481Z 
2025-05-07T20:32:35.6056603Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.6056610Z 
2025-05-07T20:32:35.6056712Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6056945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6057021Z     T=4096,
2025-05-07T20:32:35.6057095Z     D=7168,
2025-05-07T20:32:35.6057179Z     scale_ub=1200.0,
2025-05-07T20:32:35.6057264Z     contiguous=True,
2025-05-07T20:32:35.6057348Z     compiled=True,
2025-05-07T20:32:35.6057423Z )
2025-05-07T20:32:35.6057649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6057829Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.6057834Z 
2025-05-07T20:32:35.6057910Z     @given(
2025-05-07T20:32:35.6058031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6058133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6058247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6058362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6058482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6058555Z     )
2025-05-07T20:32:35.6058812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6058908Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6058991Z         self,
2025-05-07T20:32:35.6059096Z         T: int,
2025-05-07T20:32:35.6059208Z         D: int,
2025-05-07T20:32:35.6059343Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6059434Z         contiguous: bool,
2025-05-07T20:32:35.6059658Z         compiled: bool,
2025-05-07T20:32:35.6059738Z     ) -> None:
2025-05-07T20:32:35.6059835Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6059906Z     
2025-05-07T20:32:35.6060078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6060156Z     
2025-05-07T20:32:35.6060247Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6060372Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6062294Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6062303Z 
2025-05-07T20:32:35.6062422Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.6062427Z 
2025-05-07T20:32:35.6062599Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6062836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6062932Z     T=16384,
2025-05-07T20:32:35.6063014Z     D=7168,
2025-05-07T20:32:35.6063156Z     scale_ub=None,
2025-05-07T20:32:35.6063246Z     contiguous=False,
2025-05-07T20:32:35.6063329Z     compiled=False,
2025-05-07T20:32:35.6063402Z )
2025-05-07T20:32:35.6063628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6063809Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.6063854Z 
2025-05-07T20:32:35.6063931Z     @given(
2025-05-07T20:32:35.6064054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6064155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6064267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6064386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6064502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6064576Z     )
2025-05-07T20:32:35.6064831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6064925Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6065005Z         self,
2025-05-07T20:32:35.6065081Z         T: int,
2025-05-07T20:32:35.6065155Z         D: int,
2025-05-07T20:32:35.6065252Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6065339Z         contiguous: bool,
2025-05-07T20:32:35.6065424Z         compiled: bool,
2025-05-07T20:32:35.6065506Z     ) -> None:
2025-05-07T20:32:35.6065601Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6065674Z     
2025-05-07T20:32:35.6065845Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6067764Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6067777Z 
2025-05-07T20:32:35.6067896Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6067901Z 
2025-05-07T20:32:35.6068001Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6068236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6068311Z     T=2048,
2025-05-07T20:32:35.6068385Z     D=7168,
2025-05-07T20:32:35.6068473Z     scale_ub=1200.0,
2025-05-07T20:32:35.6068602Z     contiguous=True,
2025-05-07T20:32:35.6068685Z     compiled=True,
2025-05-07T20:32:35.6068759Z )
2025-05-07T20:32:35.6068986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6069160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.6069169Z 
2025-05-07T20:32:35.6069244Z     @given(
2025-05-07T20:32:35.6069364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6069464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6069576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6069692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6069808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6069882Z     )
2025-05-07T20:32:35.6070136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6070231Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6070309Z         self,
2025-05-07T20:32:35.6070387Z         T: int,
2025-05-07T20:32:35.6070464Z         D: int,
2025-05-07T20:32:35.6070560Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6070695Z         contiguous: bool,
2025-05-07T20:32:35.6070781Z         compiled: bool,
2025-05-07T20:32:35.6070857Z     ) -> None:
2025-05-07T20:32:35.6070952Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6071064Z     
2025-05-07T20:32:35.6071233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6071309Z     
2025-05-07T20:32:35.6071402Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6071525Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6073435Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6073479Z 
2025-05-07T20:32:35.6073597Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.6073605Z 
2025-05-07T20:32:35.6073709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6073935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6074015Z     T=2048,
2025-05-07T20:32:35.6074091Z     D=7168,
2025-05-07T20:32:35.6074173Z     scale_ub=None,
2025-05-07T20:32:35.6074259Z     contiguous=True,
2025-05-07T20:32:35.6074344Z     compiled=False,
2025-05-07T20:32:35.6074417Z )
2025-05-07T20:32:35.6074644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6074820Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6074824Z 
2025-05-07T20:32:35.6074898Z     @given(
2025-05-07T20:32:35.6075021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6075118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6075234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6075348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6075462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6075536Z     )
2025-05-07T20:32:35.6075785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6075879Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6075957Z         self,
2025-05-07T20:32:35.6076035Z         T: int,
2025-05-07T20:32:35.6076110Z         D: int,
2025-05-07T20:32:35.6076211Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6076299Z         contiguous: bool,
2025-05-07T20:32:35.6076383Z         compiled: bool,
2025-05-07T20:32:35.6076506Z     ) -> None:
2025-05-07T20:32:35.6076601Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6076676Z     
2025-05-07T20:32:35.6076846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6076919Z     
2025-05-07T20:32:35.6077015Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.6078859Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6078869Z 
2025-05-07T20:32:35.6078993Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.6078998Z 
2025-05-07T20:32:35.6079098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6079365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6079444Z     T=1,
2025-05-07T20:32:35.6079521Z     D=7168,
2025-05-07T20:32:35.6079603Z     scale_ub=1200.0,
2025-05-07T20:32:35.6079688Z     contiguous=True,
2025-05-07T20:32:35.6079811Z     compiled=False,
2025-05-07T20:32:35.6079886Z )
2025-05-07T20:32:35.6080111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6080280Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6080285Z 
2025-05-07T20:32:35.6080362Z     @given(
2025-05-07T20:32:35.6080520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6080617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6080735Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6080854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6080967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6081041Z     )
2025-05-07T20:32:35.6081291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6081388Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6081463Z         self,
2025-05-07T20:32:35.6081539Z         T: int,
2025-05-07T20:32:35.6081621Z         D: int,
2025-05-07T20:32:35.6081718Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6081810Z         contiguous: bool,
2025-05-07T20:32:35.6081898Z         compiled: bool,
2025-05-07T20:32:35.6081975Z     ) -> None:
2025-05-07T20:32:35.6082069Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6082148Z     
2025-05-07T20:32:35.6082316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6082389Z     
2025-05-07T20:32:35.6082481Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6082605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6082699Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6082790Z         x0 = x[:, :D]
2025-05-07T20:32:35.6082882Z         x1 = x[:, D:]
2025-05-07T20:32:35.6082970Z     
2025-05-07T20:32:35.6083067Z         if contiguous:
2025-05-07T20:32:35.6083156Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6083245Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6083320Z     
2025-05-07T20:32:35.6083411Z         if scale_ub is not None:
2025-05-07T20:32:35.6083517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6083651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6083726Z             )
2025-05-07T20:32:35.6083804Z         else:
2025-05-07T20:32:35.6083900Z             scale_ub_tensor = None
2025-05-07T20:32:35.6083970Z     
2025-05-07T20:32:35.6084101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6084191Z             op = silu_mul_quant
2025-05-07T20:32:35.6084325Z             if compiled:
2025-05-07T20:32:35.6084425Z                 op = torch.compile(op)
2025-05-07T20:32:35.6084529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6084607Z     
2025-05-07T20:32:35.6084697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6084702Z 
2025-05-07T20:32:35.6084797Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6084936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6085037Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6085134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6085652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6085750Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6086125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6086356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6086708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6086850Z     kernel = self.compile(
2025-05-07T20:32:35.6087247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6087466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6087596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6087601Z 
2025-05-07T20:32:35.6087807Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175dbce0>
2025-05-07T20:32:35.6088619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6089176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917552520>}
2025-05-07T20:32:35.6089958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6090156Z context = <triton._C.libtriton.ir.context object at 0x7f89176cb5f0>
2025-05-07T20:32:35.6090161Z 
2025-05-07T20:32:35.6090330Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6090606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6090714Z                            module_map=module_map)
2025-05-07T20:32:35.6090882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6090981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6091058Z E       ^
2025-05-07T20:32:35.6091434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6091441Z 
2025-05-07T20:32:35.6091951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6091956Z 
2025-05-07T20:32:35.6092067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6092296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6092377Z     T=128,
2025-05-07T20:32:35.6092455Z     D=5120,
2025-05-07T20:32:35.6092536Z     scale_ub=None,
2025-05-07T20:32:35.6092619Z     contiguous=True,
2025-05-07T20:32:35.6092704Z     compiled=False,
2025-05-07T20:32:35.6092778Z )
2025-05-07T20:32:35.6093004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6093183Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6093258Z 
2025-05-07T20:32:35.6093335Z     @given(
2025-05-07T20:32:35.6093455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6093557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6093672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6093791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6093908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6093980Z     )
2025-05-07T20:32:35.6094234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6094327Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6094404Z         self,
2025-05-07T20:32:35.6094483Z         T: int,
2025-05-07T20:32:35.6094562Z         D: int,
2025-05-07T20:32:35.6094658Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6094748Z         contiguous: bool,
2025-05-07T20:32:35.6094832Z         compiled: bool,
2025-05-07T20:32:35.6094917Z     ) -> None:
2025-05-07T20:32:35.6095014Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6095086Z     
2025-05-07T20:32:35.6095262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6095378Z     
2025-05-07T20:32:35.6095472Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6095599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6095685Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6095803Z         x0 = x[:, :D]
2025-05-07T20:32:35.6095885Z         x1 = x[:, D:]
2025-05-07T20:32:35.6095955Z     
2025-05-07T20:32:35.6096036Z         if contiguous:
2025-05-07T20:32:35.6096129Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6096217Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6096332Z     
2025-05-07T20:32:35.6096420Z         if scale_ub is not None:
2025-05-07T20:32:35.6096524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6096663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6096741Z             )
2025-05-07T20:32:35.6096817Z         else:
2025-05-07T20:32:35.6096912Z             scale_ub_tensor = None
2025-05-07T20:32:35.6096983Z     
2025-05-07T20:32:35.6097113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6097206Z             op = silu_mul_quant
2025-05-07T20:32:35.6097289Z             if compiled:
2025-05-07T20:32:35.6097389Z                 op = torch.compile(op)
2025-05-07T20:32:35.6097500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6097570Z     
2025-05-07T20:32:35.6097661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6097665Z 
2025-05-07T20:32:35.6097761Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6097893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6097996Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6098095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6098617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6098717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6099091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6099324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6099685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6099778Z     kernel = self.compile(
2025-05-07T20:32:35.6100179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6100357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6100489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6100496Z 
2025-05-07T20:32:35.6100750Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176e9fa0>
2025-05-07T20:32:35.6101581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6102110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917553420>}
2025-05-07T20:32:35.6102900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6103101Z context = <triton._C.libtriton.ir.context object at 0x7f891728e1b0>
2025-05-07T20:32:35.6103106Z 
2025-05-07T20:32:35.6103276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6103551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6103660Z                            module_map=module_map)
2025-05-07T20:32:35.6103863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6103963Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6104041Z E       ^
2025-05-07T20:32:35.6104412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6104454Z 
2025-05-07T20:32:35.6104892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6104896Z 
2025-05-07T20:32:35.6104998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6105270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6105352Z     T=128,
2025-05-07T20:32:35.6105426Z     D=7168,
2025-05-07T20:32:35.6105511Z     scale_ub=None,
2025-05-07T20:32:35.6105598Z     contiguous=True,
2025-05-07T20:32:35.6105681Z     compiled=False,
2025-05-07T20:32:35.6105759Z )
2025-05-07T20:32:35.6105987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6106435Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6106443Z 
2025-05-07T20:32:35.6106560Z     @given(
2025-05-07T20:32:35.6106689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6106788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6106909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6110314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6110452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6110537Z     )
2025-05-07T20:32:35.6110796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6110892Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6110970Z         self,
2025-05-07T20:32:35.6111047Z         T: int,
2025-05-07T20:32:35.6111128Z         D: int,
2025-05-07T20:32:35.6111224Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6111317Z         contiguous: bool,
2025-05-07T20:32:35.6111406Z         compiled: bool,
2025-05-07T20:32:35.6111485Z     ) -> None:
2025-05-07T20:32:35.6111582Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6111657Z     
2025-05-07T20:32:35.6111835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6111913Z     
2025-05-07T20:32:35.6112005Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6112130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6112220Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6112301Z         x0 = x[:, :D]
2025-05-07T20:32:35.6112379Z         x1 = x[:, D:]
2025-05-07T20:32:35.6112455Z     
2025-05-07T20:32:35.6112537Z         if contiguous:
2025-05-07T20:32:35.6112630Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6112851Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6112935Z     
2025-05-07T20:32:35.6113040Z         if scale_ub is not None:
2025-05-07T20:32:35.6113152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6113287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6113364Z             )
2025-05-07T20:32:35.6113439Z         else:
2025-05-07T20:32:35.6113535Z             scale_ub_tensor = None
2025-05-07T20:32:35.6113608Z     
2025-05-07T20:32:35.6113737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6113825Z             op = silu_mul_quant
2025-05-07T20:32:35.6113911Z             if compiled:
2025-05-07T20:32:35.6114009Z                 op = torch.compile(op)
2025-05-07T20:32:35.6114119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6114194Z     
2025-05-07T20:32:35.6114283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6114288Z 
2025-05-07T20:32:35.6114389Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6114524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6114623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6114789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6115310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6115461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6115837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6116065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6116419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6116576Z     kernel = self.compile(
2025-05-07T20:32:35.6116973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6117153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6117290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6117294Z 
2025-05-07T20:32:35.6117501Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176ea420>
2025-05-07T20:32:35.6118305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6118822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172984a0>}
2025-05-07T20:32:35.6119603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6119796Z context = <triton._C.libtriton.ir.context object at 0x7f89172a1b30>
2025-05-07T20:32:35.6119803Z 
2025-05-07T20:32:35.6119973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6120242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6120352Z                            module_map=module_map)
2025-05-07T20:32:35.6120518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6120616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6120691Z E       ^
2025-05-07T20:32:35.6121057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6121064Z 
2025-05-07T20:32:35.6121535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6121540Z 
2025-05-07T20:32:35.6121648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6121877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6121952Z     T=2048,
2025-05-07T20:32:35.6122032Z     D=7168,
2025-05-07T20:32:35.6122114Z     scale_ub=1200.0,
2025-05-07T20:32:35.6122198Z     contiguous=True,
2025-05-07T20:32:35.6122286Z     compiled=False,
2025-05-07T20:32:35.6122360Z )
2025-05-07T20:32:35.6122587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6122780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6122786Z 
2025-05-07T20:32:35.6122871Z     @given(
2025-05-07T20:32:35.6123019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6123117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6123231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6123354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6123469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6123543Z     )
2025-05-07T20:32:35.6123839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6123934Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6124014Z         self,
2025-05-07T20:32:35.6124091Z         T: int,
2025-05-07T20:32:35.6124205Z         D: int,
2025-05-07T20:32:35.6124305Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6124394Z         contiguous: bool,
2025-05-07T20:32:35.6124479Z         compiled: bool,
2025-05-07T20:32:35.6124563Z     ) -> None:
2025-05-07T20:32:35.6124657Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6124797Z     
2025-05-07T20:32:35.6124969Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6126830Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6126839Z 
2025-05-07T20:32:35.6126960Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6126965Z 
2025-05-07T20:32:35.6127064Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6127296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6127377Z     T=1,
2025-05-07T20:32:35.6127453Z     D=5120,
2025-05-07T20:32:35.6127540Z     scale_ub=1200.0,
2025-05-07T20:32:35.6127624Z     contiguous=True,
2025-05-07T20:32:35.6127707Z     compiled=False,
2025-05-07T20:32:35.6127785Z )
2025-05-07T20:32:35.6128012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6128183Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6128188Z 
2025-05-07T20:32:35.6128269Z     @given(
2025-05-07T20:32:35.6128387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6128491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6128605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6128721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6128840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6128912Z     )
2025-05-07T20:32:35.6129161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6129260Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6129337Z         self,
2025-05-07T20:32:35.6129413Z         T: int,
2025-05-07T20:32:35.6129492Z         D: int,
2025-05-07T20:32:35.6129636Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6129728Z         contiguous: bool,
2025-05-07T20:32:35.6129818Z         compiled: bool,
2025-05-07T20:32:35.6129897Z     ) -> None:
2025-05-07T20:32:35.6129994Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6130068Z     
2025-05-07T20:32:35.6130236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6130316Z     
2025-05-07T20:32:35.6130408Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6130532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6130623Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6130703Z         x0 = x[:, :D]
2025-05-07T20:32:35.6130782Z         x1 = x[:, D:]
2025-05-07T20:32:35.6130862Z     
2025-05-07T20:32:35.6130947Z         if contiguous:
2025-05-07T20:32:35.6131038Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6131131Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6131205Z     
2025-05-07T20:32:35.6131301Z         if scale_ub is not None:
2025-05-07T20:32:35.6131407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6131543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6131665Z             )
2025-05-07T20:32:35.6131742Z         else:
2025-05-07T20:32:35.6131921Z             scale_ub_tensor = None
2025-05-07T20:32:35.6131999Z     
2025-05-07T20:32:35.6132130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6132269Z             op = silu_mul_quant
2025-05-07T20:32:35.6132358Z             if compiled:
2025-05-07T20:32:35.6132456Z                 op = torch.compile(op)
2025-05-07T20:32:35.6132561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6132637Z     
2025-05-07T20:32:35.6132769Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6132774Z 
2025-05-07T20:32:35.6132873Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6133003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6133106Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6133211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6133730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6133826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6134197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6134427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6134779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6134875Z     kernel = self.compile(
2025-05-07T20:32:35.6135271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6135454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6135585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6135589Z 
2025-05-07T20:32:35.6135797Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176eade0>
2025-05-07T20:32:35.6136605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6137125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917299a80>}
2025-05-07T20:32:35.6137902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6138139Z context = <triton._C.libtriton.ir.context object at 0x7f8917212c70>
2025-05-07T20:32:35.6138144Z 
2025-05-07T20:32:35.6138317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6138590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6138696Z                            module_map=module_map)
2025-05-07T20:32:35.6138864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6138964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6139039Z E       ^
2025-05-07T20:32:35.6139405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6139410Z 
2025-05-07T20:32:35.6139838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6139843Z 
2025-05-07T20:32:35.6139948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6140179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6140257Z     T=2048,
2025-05-07T20:32:35.6140337Z     D=5120,
2025-05-07T20:32:35.6140458Z     scale_ub=None,
2025-05-07T20:32:35.6140545Z     contiguous=True,
2025-05-07T20:32:35.6140633Z     compiled=False,
2025-05-07T20:32:35.6140704Z )
2025-05-07T20:32:35.6140931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6141148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6141152Z 
2025-05-07T20:32:35.6141226Z     @given(
2025-05-07T20:32:35.6141350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6141447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6141600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6141721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6141832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6141911Z     )
2025-05-07T20:32:35.6142161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6142254Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6142332Z         self,
2025-05-07T20:32:35.6142407Z         T: int,
2025-05-07T20:32:35.6142482Z         D: int,
2025-05-07T20:32:35.6142582Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6142674Z         contiguous: bool,
2025-05-07T20:32:35.6142759Z         compiled: bool,
2025-05-07T20:32:35.6142838Z     ) -> None:
2025-05-07T20:32:35.6142931Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6143002Z     
2025-05-07T20:32:35.6143173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6143250Z     
2025-05-07T20:32:35.6143365Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.6145258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6145267Z 
2025-05-07T20:32:35.6145389Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.6145394Z 
2025-05-07T20:32:35.6145496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6145725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6145802Z     T=16384,
2025-05-07T20:32:35.6145880Z     D=5120,
2025-05-07T20:32:35.6145961Z     scale_ub=None,
2025-05-07T20:32:35.6146047Z     contiguous=True,
2025-05-07T20:32:35.6146129Z     compiled=False,
2025-05-07T20:32:35.6146200Z )
2025-05-07T20:32:35.6146473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6146654Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6146661Z 
2025-05-07T20:32:35.6146741Z     @given(
2025-05-07T20:32:35.6146858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6146957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6147076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6147190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6147302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6147377Z     )
2025-05-07T20:32:35.6147625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6147721Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6147800Z         self,
2025-05-07T20:32:35.6147877Z         T: int,
2025-05-07T20:32:35.6147952Z         D: int,
2025-05-07T20:32:35.6148056Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6148145Z         contiguous: bool,
2025-05-07T20:32:35.6148234Z         compiled: bool,
2025-05-07T20:32:35.6148310Z     ) -> None:
2025-05-07T20:32:35.6148451Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6148530Z     
2025-05-07T20:32:35.6148697Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6150547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6150629Z 
2025-05-07T20:32:35.6150750Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6150755Z 
2025-05-07T20:32:35.6150856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6151087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6151162Z     T=4096,
2025-05-07T20:32:35.6151237Z     D=5120,
2025-05-07T20:32:35.6151321Z     scale_ub=None,
2025-05-07T20:32:35.6151407Z     contiguous=True,
2025-05-07T20:32:35.6151493Z     compiled=False,
2025-05-07T20:32:35.6151565Z )
2025-05-07T20:32:35.6151787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6151964Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6151968Z 
2025-05-07T20:32:35.6152046Z     @given(
2025-05-07T20:32:35.6152163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6152264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6152378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6152495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6152611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6152686Z     )
2025-05-07T20:32:35.6152940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6153032Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6153110Z         self,
2025-05-07T20:32:35.6153190Z         T: int,
2025-05-07T20:32:35.6153267Z         D: int,
2025-05-07T20:32:35.6153365Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6153456Z         contiguous: bool,
2025-05-07T20:32:35.6153541Z         compiled: bool,
2025-05-07T20:32:35.6153619Z     ) -> None:
2025-05-07T20:32:35.6153715Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6153789Z     
2025-05-07T20:32:35.6153956Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6155857Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6155865Z 
2025-05-07T20:32:35.6155988Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6155992Z 
2025-05-07T20:32:35.6156091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6156317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6156397Z     T=2048,
2025-05-07T20:32:35.6156471Z     D=5120,
2025-05-07T20:32:35.6156552Z     scale_ub=None,
2025-05-07T20:32:35.6156640Z     contiguous=False,
2025-05-07T20:32:35.6156725Z     compiled=False,
2025-05-07T20:32:35.6156796Z )
2025-05-07T20:32:35.6157022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6157240Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.6157245Z 
2025-05-07T20:32:35.6157325Z     @given(
2025-05-07T20:32:35.6157443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6157600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6157716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6157832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6157944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6158021Z     )
2025-05-07T20:32:35.6158313Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6158407Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6158486Z         self,
2025-05-07T20:32:35.6158562Z         T: int,
2025-05-07T20:32:35.6158640Z         D: int,
2025-05-07T20:32:35.6158741Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6158831Z         contiguous: bool,
2025-05-07T20:32:35.6158924Z         compiled: bool,
2025-05-07T20:32:35.6159001Z     ) -> None:
2025-05-07T20:32:35.6159096Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6159173Z     
2025-05-07T20:32:35.6159340Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6161314Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6161330Z 
2025-05-07T20:32:35.6161449Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6161454Z 
2025-05-07T20:32:35.6161558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6161822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6161928Z     T=4096,
2025-05-07T20:32:35.6162039Z     D=7168,
2025-05-07T20:32:35.6162165Z     scale_ub=None,
2025-05-07T20:32:35.6162256Z     contiguous=True,
2025-05-07T20:32:35.6162345Z     compiled=True,
2025-05-07T20:32:35.6162419Z )
2025-05-07T20:32:35.6162644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6162819Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.6162826Z 
2025-05-07T20:32:35.6162901Z     @given(
2025-05-07T20:32:35.6163018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6163123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6163296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6163414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6163541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6163632Z     )
2025-05-07T20:32:35.6163910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6164006Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6164083Z         self,
2025-05-07T20:32:35.6164163Z         T: int,
2025-05-07T20:32:35.6164239Z         D: int,
2025-05-07T20:32:35.6164336Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6164429Z         contiguous: bool,
2025-05-07T20:32:35.6164516Z         compiled: bool,
2025-05-07T20:32:35.6164598Z     ) -> None:
2025-05-07T20:32:35.6164695Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6164768Z     
2025-05-07T20:32:35.6164934Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6166830Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6166872Z 
2025-05-07T20:32:35.6166994Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6166999Z 
2025-05-07T20:32:35.6167100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6167368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6167453Z     T=2048,
2025-05-07T20:32:35.6167529Z     D=5120,
2025-05-07T20:32:35.6167610Z     scale_ub=1200.0,
2025-05-07T20:32:35.6167703Z     contiguous=False,
2025-05-07T20:32:35.6167786Z     compiled=False,
2025-05-07T20:32:35.6167857Z )
2025-05-07T20:32:35.6168083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6168261Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.6168265Z 
2025-05-07T20:32:35.6168345Z     @given(
2025-05-07T20:32:35.6168464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6168560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6168676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6168792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6168904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6168981Z     )
2025-05-07T20:32:35.6169231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6169324Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6169404Z         self,
2025-05-07T20:32:35.6169480Z         T: int,
2025-05-07T20:32:35.6169556Z         D: int,
2025-05-07T20:32:35.6169655Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6169745Z         contiguous: bool,
2025-05-07T20:32:35.6169833Z         compiled: bool,
2025-05-07T20:32:35.6169910Z     ) -> None:
2025-05-07T20:32:35.6170002Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6170080Z     
2025-05-07T20:32:35.6170247Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6172200Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6172214Z 
2025-05-07T20:32:35.6172333Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6172338Z 
2025-05-07T20:32:35.6172439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6172667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6172745Z     T=4096,
2025-05-07T20:32:35.6172821Z     D=7168,
2025-05-07T20:32:35.6172908Z     scale_ub=1200.0,
2025-05-07T20:32:35.6172989Z     contiguous=True,
2025-05-07T20:32:35.6173073Z     compiled=False,
2025-05-07T20:32:35.6173151Z )
2025-05-07T20:32:35.6173398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6173602Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6173606Z 
2025-05-07T20:32:35.6173682Z     @given(
2025-05-07T20:32:35.6173801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6173901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6174016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6174175Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6174291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6174364Z     )
2025-05-07T20:32:35.6174612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6174755Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6174832Z         self,
2025-05-07T20:32:35.6174910Z         T: int,
2025-05-07T20:32:35.6174986Z         D: int,
2025-05-07T20:32:35.6175080Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6175170Z         contiguous: bool,
2025-05-07T20:32:35.6175294Z         compiled: bool,
2025-05-07T20:32:35.6175371Z     ) -> None:
2025-05-07T20:32:35.6175467Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6175538Z     
2025-05-07T20:32:35.6175707Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6177566Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6177574Z 
2025-05-07T20:32:35.6177692Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6177699Z 
2025-05-07T20:32:35.6177802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6178027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6178106Z     T=16384,
2025-05-07T20:32:35.6178183Z     D=7168,
2025-05-07T20:32:35.6178263Z     scale_ub=None,
2025-05-07T20:32:35.6178349Z     contiguous=False,
2025-05-07T20:32:35.6178430Z     compiled=True,
2025-05-07T20:32:35.6178505Z )
2025-05-07T20:32:35.6178729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6178907Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.6178914Z 
2025-05-07T20:32:35.6178990Z     @given(
2025-05-07T20:32:35.6179109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6179204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6179318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6179432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6179545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6179620Z     )
2025-05-07T20:32:35.6179870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6180007Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6180090Z         self,
2025-05-07T20:32:35.6180168Z         T: int,
2025-05-07T20:32:35.6180241Z         D: int,
2025-05-07T20:32:35.6180343Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6180431Z         contiguous: bool,
2025-05-07T20:32:35.6180515Z         compiled: bool,
2025-05-07T20:32:35.6180596Z     ) -> None:
2025-05-07T20:32:35.6180693Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6180769Z     
2025-05-07T20:32:35.6180937Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6182785Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6182839Z 
2025-05-07T20:32:35.6182963Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6182967Z 
2025-05-07T20:32:35.6183072Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6183368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6183444Z     T=4096,
2025-05-07T20:32:35.6183519Z     D=7168,
2025-05-07T20:32:35.6183603Z     scale_ub=None,
2025-05-07T20:32:35.6183687Z     contiguous=True,
2025-05-07T20:32:35.6183771Z     compiled=False,
2025-05-07T20:32:35.6183845Z )
2025-05-07T20:32:35.6184136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6184327Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6184332Z 
2025-05-07T20:32:35.6184411Z     @given(
2025-05-07T20:32:35.6184533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6184636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6184758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6184882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6185003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6185078Z     )
2025-05-07T20:32:35.6185368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6185462Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6185538Z         self,
2025-05-07T20:32:35.6185619Z         T: int,
2025-05-07T20:32:35.6185694Z         D: int,
2025-05-07T20:32:35.6185796Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6185890Z         contiguous: bool,
2025-05-07T20:32:35.6185976Z         compiled: bool,
2025-05-07T20:32:35.6186054Z     ) -> None:
2025-05-07T20:32:35.6186152Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6186226Z     
2025-05-07T20:32:35.6186407Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6188689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6188699Z 
2025-05-07T20:32:35.6188822Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6188827Z 
2025-05-07T20:32:35.6188935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6189232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6189312Z     T=16384,
2025-05-07T20:32:35.6189389Z     D=7168,
2025-05-07T20:32:35.6189471Z     scale_ub=None,
2025-05-07T20:32:35.6189561Z     contiguous=True,
2025-05-07T20:32:35.6189645Z     compiled=False,
2025-05-07T20:32:35.6189718Z )
2025-05-07T20:32:35.6189970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6190165Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.6190170Z 
2025-05-07T20:32:35.6190247Z     @given(
2025-05-07T20:32:35.6190373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6190472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6190597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6190723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6190842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6190920Z     )
2025-05-07T20:32:35.6191206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6191304Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6191383Z         self,
2025-05-07T20:32:35.6191527Z         T: int,
2025-05-07T20:32:35.6191603Z         D: int,
2025-05-07T20:32:35.6191702Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6191789Z         contiguous: bool,
2025-05-07T20:32:35.6191912Z         compiled: bool,
2025-05-07T20:32:35.6191993Z     ) -> None:
2025-05-07T20:32:35.6192087Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6192163Z     
2025-05-07T20:32:35.6192330Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6194234Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6194357Z 
2025-05-07T20:32:35.6194473Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6194478Z 
2025-05-07T20:32:35.6194581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6194809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6194885Z     T=16384,
2025-05-07T20:32:35.6194962Z     D=7168,
2025-05-07T20:32:35.6195046Z     scale_ub=1200.0,
2025-05-07T20:32:35.6195129Z     contiguous=True,
2025-05-07T20:32:35.6195214Z     compiled=False,
2025-05-07T20:32:35.6195289Z )
2025-05-07T20:32:35.6195508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6195693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6195697Z 
2025-05-07T20:32:35.6195772Z     @given(
2025-05-07T20:32:35.6195890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6195990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6196104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6196220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6196336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6196408Z     )
2025-05-07T20:32:35.6196660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6196754Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6196830Z         self,
2025-05-07T20:32:35.6196915Z         T: int,
2025-05-07T20:32:35.6196989Z         D: int,
2025-05-07T20:32:35.6197084Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6197176Z         contiguous: bool,
2025-05-07T20:32:35.6197260Z         compiled: bool,
2025-05-07T20:32:35.6197379Z     ) -> None:
2025-05-07T20:32:35.6197475Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6197546Z     
2025-05-07T20:32:35.6197715Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6199566Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6199578Z 
2025-05-07T20:32:35.6199692Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6199696Z 
2025-05-07T20:32:35.6199799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6200028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6200108Z     T=128,
2025-05-07T20:32:35.6200187Z     D=5120,
2025-05-07T20:32:35.6200311Z     scale_ub=1200.0,
2025-05-07T20:32:35.6200400Z     contiguous=False,
2025-05-07T20:32:35.6200483Z     compiled=False,
2025-05-07T20:32:35.6200555Z )
2025-05-07T20:32:35.6200777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6200989Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.6200994Z 
2025-05-07T20:32:35.6201070Z     @given(
2025-05-07T20:32:35.6201189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6201287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6201443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6201557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6201673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6201748Z     )
2025-05-07T20:32:35.6201997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6202091Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6202170Z         self,
2025-05-07T20:32:35.6202247Z         T: int,
2025-05-07T20:32:35.6202323Z         D: int,
2025-05-07T20:32:35.6202423Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6202514Z         contiguous: bool,
2025-05-07T20:32:35.6202597Z         compiled: bool,
2025-05-07T20:32:35.6202677Z     ) -> None:
2025-05-07T20:32:35.6202769Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6202842Z     
2025-05-07T20:32:35.6203009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6203085Z     
2025-05-07T20:32:35.6203179Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6203303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6203390Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6203473Z         x0 = x[:, :D]
2025-05-07T20:32:35.6203552Z         x1 = x[:, D:]
2025-05-07T20:32:35.6203622Z     
2025-05-07T20:32:35.6203710Z         if contiguous:
2025-05-07T20:32:35.6203802Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6203892Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6203965Z     
2025-05-07T20:32:35.6204055Z         if scale_ub is not None:
2025-05-07T20:32:35.6204169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6204304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6204378Z             )
2025-05-07T20:32:35.6204457Z         else:
2025-05-07T20:32:35.6204552Z             scale_ub_tensor = None
2025-05-07T20:32:35.6204624Z     
2025-05-07T20:32:35.6204763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6204852Z             op = silu_mul_quant
2025-05-07T20:32:35.6204936Z             if compiled:
2025-05-07T20:32:35.6205038Z                 op = torch.compile(op)
2025-05-07T20:32:35.6205190Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6205263Z     
2025-05-07T20:32:35.6205355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6205359Z 
2025-05-07T20:32:35.6205459Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6205593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6205693Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6205795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6206566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6206668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6207041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6207277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6207630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6207727Z     kernel = self.compile(
2025-05-07T20:32:35.6208216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6208396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6208528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6208590Z 
2025-05-07T20:32:35.6208798Z self = <triton.compiler.compiler.ASTSource object at 0x7f89174534d0>
2025-05-07T20:32:35.6209612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6210199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173e07c0>}
2025-05-07T20:32:35.6210975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6211171Z context = <triton._C.libtriton.ir.context object at 0x7f891732a470>
2025-05-07T20:32:35.6211178Z 
2025-05-07T20:32:35.6211346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6211622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6211728Z                            module_map=module_map)
2025-05-07T20:32:35.6211958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6212062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6212139Z E       ^
2025-05-07T20:32:35.6212511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6212516Z 
2025-05-07T20:32:35.6212996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6213001Z 
2025-05-07T20:32:35.6213101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6213333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6213411Z     T=2048,
2025-05-07T20:32:35.6213487Z     D=7168,
2025-05-07T20:32:35.6213572Z     scale_ub=None,
2025-05-07T20:32:35.6213658Z     contiguous=False,
2025-05-07T20:32:35.6213745Z     compiled=False,
2025-05-07T20:32:35.6213817Z )
2025-05-07T20:32:35.6214042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6214224Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.6214229Z 
2025-05-07T20:32:35.6214304Z     @given(
2025-05-07T20:32:35.6214491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6214593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6214709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6214823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6214939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6215010Z     )
2025-05-07T20:32:35.6215270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6215363Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6215438Z         self,
2025-05-07T20:32:35.6215517Z         T: int,
2025-05-07T20:32:35.6215592Z         D: int,
2025-05-07T20:32:35.6215689Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6215783Z         contiguous: bool,
2025-05-07T20:32:35.6215868Z         compiled: bool,
2025-05-07T20:32:35.6215947Z     ) -> None:
2025-05-07T20:32:35.6216042Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6216114Z     
2025-05-07T20:32:35.6216283Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6218193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6218237Z 
2025-05-07T20:32:35.6218355Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6218402Z 
2025-05-07T20:32:35.6218505Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6218732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6218815Z     T=128,
2025-05-07T20:32:35.6218890Z     D=7168,
2025-05-07T20:32:35.6218973Z     scale_ub=1200.0,
2025-05-07T20:32:35.6219062Z     contiguous=True,
2025-05-07T20:32:35.6219147Z     compiled=True,
2025-05-07T20:32:35.6219218Z )
2025-05-07T20:32:35.6219444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6219615Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.6219622Z 
2025-05-07T20:32:35.6219696Z     @given(
2025-05-07T20:32:35.6219819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6219916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6220032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6220151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6220263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6220340Z     )
2025-05-07T20:32:35.6220593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6220684Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6220761Z         self,
2025-05-07T20:32:35.6220839Z         T: int,
2025-05-07T20:32:35.6220915Z         D: int,
2025-05-07T20:32:35.6221016Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6221104Z         contiguous: bool,
2025-05-07T20:32:35.6221189Z         compiled: bool,
2025-05-07T20:32:35.6221268Z     ) -> None:
2025-05-07T20:32:35.6221360Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6221433Z     
2025-05-07T20:32:35.6221601Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6221674Z     
2025-05-07T20:32:35.6221768Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6221890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6221981Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6222063Z         x0 = x[:, :D]
2025-05-07T20:32:35.6222142Z         x1 = x[:, D:]
2025-05-07T20:32:35.6222214Z     
2025-05-07T20:32:35.6222370Z         if contiguous:
2025-05-07T20:32:35.6222462Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6222549Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6222625Z     
2025-05-07T20:32:35.6222735Z         if scale_ub is not None:
2025-05-07T20:32:35.6222853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6223007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6223084Z             )
2025-05-07T20:32:35.6223166Z         else:
2025-05-07T20:32:35.6223262Z             scale_ub_tensor = None
2025-05-07T20:32:35.6223333Z     
2025-05-07T20:32:35.6223466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6223555Z             op = silu_mul_quant
2025-05-07T20:32:35.6223642Z             if compiled:
2025-05-07T20:32:35.6223744Z                 op = torch.compile(op)
2025-05-07T20:32:35.6223848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6223919Z     
2025-05-07T20:32:35.6224014Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6224018Z 
2025-05-07T20:32:35.6224112Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6224291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6224393Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6224492Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6224877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.6225009Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.6225519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6225620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6226026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6226261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6226611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6226707Z     kernel = self.compile(
2025-05-07T20:32:35.6227104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6227280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6227416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6227420Z 
2025-05-07T20:32:35.6227628Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917116600>
2025-05-07T20:32:35.6228431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6228955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8b36011440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173e1940>}
2025-05-07T20:32:35.6229732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6229928Z context = <triton._C.libtriton.ir.context object at 0x7f89171736b0>
2025-05-07T20:32:35.6229933Z 
2025-05-07T20:32:35.6230101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6230370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6230480Z                            module_map=module_map)
2025-05-07T20:32:35.6230642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6230742Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6230817Z E       ^
2025-05-07T20:32:35.6231224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6231229Z 
2025-05-07T20:32:35.6231663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6231667Z 
2025-05-07T20:32:35.6231769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6232000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6232076Z     T=128,
2025-05-07T20:32:35.6232150Z     D=7168,
2025-05-07T20:32:35.6232236Z     scale_ub=1200.0,
2025-05-07T20:32:35.6232319Z     contiguous=True,
2025-05-07T20:32:35.6232402Z     compiled=False,
2025-05-07T20:32:35.6232478Z )
2025-05-07T20:32:35.6232701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6232876Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6232880Z 
2025-05-07T20:32:35.6232960Z     @given(
2025-05-07T20:32:35.6233077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6233224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6233363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6233491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6233619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6233733Z     )
2025-05-07T20:32:35.6233984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6234080Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6234158Z         self,
2025-05-07T20:32:35.6234234Z         T: int,
2025-05-07T20:32:35.6234313Z         D: int,
2025-05-07T20:32:35.6234454Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6234542Z         contiguous: bool,
2025-05-07T20:32:35.6234632Z         compiled: bool,
2025-05-07T20:32:35.6234709Z     ) -> None:
2025-05-07T20:32:35.6234811Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6234884Z     
2025-05-07T20:32:35.6235053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6235137Z     
2025-05-07T20:32:35.6235229Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6235357Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6240495Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6240512Z 
2025-05-07T20:32:35.6240653Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.6240658Z 
2025-05-07T20:32:35.6240763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6240996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6241072Z     T=128,
2025-05-07T20:32:35.6241151Z     D=5120,
2025-05-07T20:32:35.6241234Z     scale_ub=1200.0,
2025-05-07T20:32:35.6241320Z     contiguous=True,
2025-05-07T20:32:35.6241405Z     compiled=True,
2025-05-07T20:32:35.6241477Z )
2025-05-07T20:32:35.6241704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6241875Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.6241880Z 
2025-05-07T20:32:35.6241957Z     @given(
2025-05-07T20:32:35.6242083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6242186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6242303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6242483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6242596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6242671Z     )
2025-05-07T20:32:35.6242926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6243018Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6243097Z         self,
2025-05-07T20:32:35.6243176Z         T: int,
2025-05-07T20:32:35.6243253Z         D: int,
2025-05-07T20:32:35.6243356Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6243466Z         contiguous: bool,
2025-05-07T20:32:35.6243562Z         compiled: bool,
2025-05-07T20:32:35.6243662Z     ) -> None:
2025-05-07T20:32:35.6243756Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6243832Z     
2025-05-07T20:32:35.6244005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6244077Z     
2025-05-07T20:32:35.6244169Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6244297Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6246186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6246232Z 
2025-05-07T20:32:35.6246351Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.6246393Z 
2025-05-07T20:32:35.6246496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6246726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6246801Z     T=128,
2025-05-07T20:32:35.6246879Z     D=7168,
2025-05-07T20:32:35.6246962Z     scale_ub=None,
2025-05-07T20:32:35.6247046Z     contiguous=True,
2025-05-07T20:32:35.6247127Z     compiled=True,
2025-05-07T20:32:35.6247201Z )
2025-05-07T20:32:35.6247426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6247597Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.6247604Z 
2025-05-07T20:32:35.6247680Z     @given(
2025-05-07T20:32:35.6247797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6247895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6248007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6248122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6248241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6248313Z     )
2025-05-07T20:32:35.6248566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6248659Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6248734Z         self,
2025-05-07T20:32:35.6248813Z         T: int,
2025-05-07T20:32:35.6248888Z         D: int,
2025-05-07T20:32:35.6248987Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6249076Z         contiguous: bool,
2025-05-07T20:32:35.6249161Z         compiled: bool,
2025-05-07T20:32:35.6249239Z     ) -> None:
2025-05-07T20:32:35.6249338Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6249410Z     
2025-05-07T20:32:35.6249577Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6251459Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6251469Z 
2025-05-07T20:32:35.6251588Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6251725Z =============================== warnings summary ===============================
2025-05-07T20:32:35.6252113Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.6252431Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.6252741Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.6253704Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:35.6253942Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:35.6253991Z 
2025-05-07T20:32:35.6254210Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:35.6254383Z ================= 1 failed, 1 deselected, 3 warnings in 13.14s =================
2025-05-07T20:32:37.1908322Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:37.2536059Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:37.2536281Z 
2025-05-07T20:32:39.2555683Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:41.4013759Z ============================= test session starts ==============================
2025-05-07T20:32:41.4014959Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:41.4015539Z cachedir: .pytest_cache
2025-05-07T20:32:41.4016129Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:41.4016875Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:41.4017292Z plugins: hypothesis-6.131.14
2025-05-07T20:32:43.0459230Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:43.1539889Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:43.1540391Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:43.1540610Z 
2025-05-07T20:32:45.5280644Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.5281367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.5281796Z     T=1,
2025-05-07T20:32:45.5281984Z     D=5120,
2025-05-07T20:32:45.5282184Z     scale_ub=None,
2025-05-07T20:32:45.5282393Z     contiguous=True,
2025-05-07T20:32:45.5282619Z     compiled=True,
2025-05-07T20:32:45.5282828Z )
2025-05-07T20:32:45.5283152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.5283657Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.5283926Z 
2025-05-07T20:32:45.5284011Z     @given(
2025-05-07T20:32:45.5284244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.5284571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.5284891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.5285225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.5285852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.5286153Z     )
2025-05-07T20:32:45.5286515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.5286967Z     def test_silu_mul_quant(
2025-05-07T20:32:45.5287214Z         self,
2025-05-07T20:32:45.5287413Z         T: int,
2025-05-07T20:32:45.5287610Z         D: int,
2025-05-07T20:32:45.5287836Z         scale_ub: Optional[float],
2025-05-07T20:32:45.5288117Z         contiguous: bool,
2025-05-07T20:32:45.5288357Z         compiled: bool,
2025-05-07T20:32:45.5288593Z     ) -> None:
2025-05-07T20:32:45.5288818Z         torch.manual_seed(2025)
2025-05-07T20:32:45.5289062Z     
2025-05-07T20:32:45.5289340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.5289698Z     
2025-05-07T20:32:45.5289889Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.5290184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.5290503Z         x = x_sign * x_clamp
2025-05-07T20:32:45.5290745Z         x0 = x[:, :D]
2025-05-07T20:32:45.5290966Z         x1 = x[:, D:]
2025-05-07T20:32:45.5291178Z     
2025-05-07T20:32:45.5291362Z         if contiguous:
2025-05-07T20:32:45.5291687Z             x0 = x0.contiguous()
2025-05-07T20:32:45.5292025Z             x1 = x1.contiguous()
2025-05-07T20:32:45.5292269Z     
2025-05-07T20:32:45.5292458Z         if scale_ub is not None:
2025-05-07T20:32:45.5292823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.5293167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.5293479Z             )
2025-05-07T20:32:45.5293681Z         else:
2025-05-07T20:32:45.5293900Z             scale_ub_tensor = None
2025-05-07T20:32:45.5294151Z     
2025-05-07T20:32:45.5294390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5294801Z             op = silu_mul_quant
2025-05-07T20:32:45.5295052Z             if compiled:
2025-05-07T20:32:45.5295306Z                 op = torch.compile(op)
2025-05-07T20:32:45.5295611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.5295913Z     
2025-05-07T20:32:45.5296135Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.5296429Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.5296728Z     
2025-05-07T20:32:45.5296966Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5297310Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.5297613Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.5297929Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.5298299Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5298620Z     
2025-05-07T20:32:45.5298819Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.5299027Z 
2025-05-07T20:32:45.5299131Z moe/activation_test.py:126: 
2025-05-07T20:32:45.5299439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5299780Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.5300118Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5300943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.5301724Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.5302287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.5303000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.5303713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.5304461Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.5305208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.5305923Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.5306835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.5307362Z     fn()
2025-05-07T20:32:45.5307890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.5308489Z     self.fn.run(
2025-05-07T20:32:45.5308966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.5309502Z     kernel = self.compile(
2025-05-07T20:32:45.5310053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.5310728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.5311130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5311373Z 
2025-05-07T20:32:45.5311585Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb3948d7f0>
2025-05-07T20:32:45.5312779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.5314278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33e99c60>}
2025-05-07T20:32:45.5315671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.5316804Z context = <triton._C.libtriton.ir.context object at 0x7fdb381d53f0>
2025-05-07T20:32:45.5317105Z 
2025-05-07T20:32:45.5317277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.5317815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.5318298Z                            module_map=module_map)
2025-05-07T20:32:45.5318665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.5319029Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.5319303Z E       ^
2025-05-07T20:32:45.5319774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.5320243Z 
2025-05-07T20:32:45.5320673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.5321212Z 
2025-05-07T20:32:45.5321317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.5321745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.5322156Z     T=2048,
2025-05-07T20:32:45.5322352Z     D=5120,
2025-05-07T20:32:45.5322550Z     scale_ub=1200.0,
2025-05-07T20:32:45.5322770Z     contiguous=True,
2025-05-07T20:32:45.5323000Z     compiled=False,
2025-05-07T20:32:45.5323216Z )
2025-05-07T20:32:46.2646323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.2647189Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.2647530Z 
2025-05-07T20:32:46.2647621Z     @given(
2025-05-07T20:32:46.2647855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.2648178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.2648492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.2648825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.2649176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.2649469Z     )
2025-05-07T20:32:46.2650160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.2650629Z     def test_silu_mul_quant(
2025-05-07T20:32:46.2650884Z         self,
2025-05-07T20:32:46.2651084Z         T: int,
2025-05-07T20:32:46.2651300Z         D: int,
2025-05-07T20:32:46.2651529Z         scale_ub: Optional[float],
2025-05-07T20:32:46.2651904Z         contiguous: bool,
2025-05-07T20:32:46.2652147Z         compiled: bool,
2025-05-07T20:32:46.2652385Z     ) -> None:
2025-05-07T20:32:46.2652615Z         torch.manual_seed(2025)
2025-05-07T20:32:46.2652862Z     
2025-05-07T20:32:46.2653143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.2653506Z     
2025-05-07T20:32:46.2653702Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.2654004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.2654332Z         x = x_sign * x_clamp
2025-05-07T20:32:46.2654575Z         x0 = x[:, :D]
2025-05-07T20:32:46.2654799Z         x1 = x[:, D:]
2025-05-07T20:32:46.2655014Z     
2025-05-07T20:32:46.2655208Z         if contiguous:
2025-05-07T20:32:46.2655447Z             x0 = x0.contiguous()
2025-05-07T20:32:46.2655712Z             x1 = x1.contiguous()
2025-05-07T20:32:46.2655953Z     
2025-05-07T20:32:46.2656243Z         if scale_ub is not None:
2025-05-07T20:32:46.2656525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.2656865Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.2657268Z             )
2025-05-07T20:32:46.2657467Z         else:
2025-05-07T20:32:46.2657686Z             scale_ub_tensor = None
2025-05-07T20:32:46.2657938Z     
2025-05-07T20:32:46.2658176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.2658502Z             op = silu_mul_quant
2025-05-07T20:32:46.2658755Z             if compiled:
2025-05-07T20:32:46.2659119Z                 op = torch.compile(op)
2025-05-07T20:32:46.2659425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2659703Z     
2025-05-07T20:32:46.2659904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.2660073Z 
2025-05-07T20:32:46.2660182Z moe/activation_test.py:117: 
2025-05-07T20:32:46.2660484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2660831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.2661122Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2661843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.2662559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.2663117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.2663828Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.2664519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.2665079Z     kernel = self.compile(
2025-05-07T20:32:46.2665645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.2666373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.2666798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2667046Z 
2025-05-07T20:32:46.2667259Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb671d3950>
2025-05-07T20:32:46.2668393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.2669848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33cf0220>}
2025-05-07T20:32:46.2671299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.2672374Z context = <triton._C.libtriton.ir.context object at 0x7fdb38155db0>
2025-05-07T20:32:46.2672685Z 
2025-05-07T20:32:46.2672856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.2673406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.2673888Z                            module_map=module_map)
2025-05-07T20:32:46.2674267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.2674635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.2674909Z E       ^
2025-05-07T20:32:46.2675386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.2675862Z 
2025-05-07T20:32:46.2676302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.2676889Z 
2025-05-07T20:32:46.2677000Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.2677475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.2677895Z     T=2048,
2025-05-07T20:32:46.2678088Z     D=5120,
2025-05-07T20:32:46.2678288Z     scale_ub=1200.0,
2025-05-07T20:32:46.2678559Z     contiguous=True,
2025-05-07T20:32:46.2678791Z     compiled=True,
2025-05-07T20:32:46.2679005Z )
2025-05-07T20:32:46.2679328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.2679843Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:46.2680171Z 
2025-05-07T20:32:46.2680256Z     @given(
2025-05-07T20:32:46.2680487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.2680809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.2681131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.2681465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.2681806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.2682104Z     )
2025-05-07T20:32:46.2682463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.2682915Z     def test_silu_mul_quant(
2025-05-07T20:32:46.2683168Z         self,
2025-05-07T20:32:46.2683373Z         T: int,
2025-05-07T20:32:46.2683574Z         D: int,
2025-05-07T20:32:46.2683803Z         scale_ub: Optional[float],
2025-05-07T20:32:46.2684087Z         contiguous: bool,
2025-05-07T20:32:46.2684332Z         compiled: bool,
2025-05-07T20:32:46.2684562Z     ) -> None:
2025-05-07T20:32:46.2684789Z         torch.manual_seed(2025)
2025-05-07T20:32:46.2685037Z     
2025-05-07T20:32:46.2685319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.2685674Z     
2025-05-07T20:32:46.2685871Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.2686200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.2686548Z         x = x_sign * x_clamp
2025-05-07T20:32:46.2686800Z         x0 = x[:, :D]
2025-05-07T20:32:46.2687019Z         x1 = x[:, D:]
2025-05-07T20:32:46.2687233Z     
2025-05-07T20:32:46.2687426Z         if contiguous:
2025-05-07T20:32:46.2687658Z             x0 = x0.contiguous()
2025-05-07T20:32:46.2687924Z             x1 = x1.contiguous()
2025-05-07T20:32:46.2688173Z     
2025-05-07T20:32:46.2688365Z         if scale_ub is not None:
2025-05-07T20:32:46.2688642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.2688985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.2689300Z             )
2025-05-07T20:32:46.2689497Z         else:
2025-05-07T20:32:46.2689712Z             scale_ub_tensor = None
2025-05-07T20:32:46.2689963Z     
2025-05-07T20:32:46.2690200Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.2690576Z             op = silu_mul_quant
2025-05-07T20:32:46.2690831Z             if compiled:
2025-05-07T20:32:46.2691084Z                 op = torch.compile(op)
2025-05-07T20:32:46.2691389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2691666Z     
2025-05-07T20:32:46.2691917Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.2692213Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.2692515Z     
2025-05-07T20:32:46.2692753Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.2693097Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.2693408Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.2693731Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.2694103Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.2694426Z     
2025-05-07T20:32:46.2694626Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.2694832Z 
2025-05-07T20:32:46.2694940Z moe/activation_test.py:126: 
2025-05-07T20:32:46.2695246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2695638Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.2695970Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.2696790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.2697623Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.2698183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.2698900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.2699658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.2700413Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.2701174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.2701845Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.2702476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.2703022Z     fn()
2025-05-07T20:32:46.2703543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.2704149Z     self.fn.run(
2025-05-07T20:32:46.2704637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.2705188Z     kernel = self.compile(
2025-05-07T20:32:46.2705750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.2706729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.2707148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2707390Z 
2025-05-07T20:32:46.2707604Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb33c244a0>
2025-05-07T20:32:46.2708734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.2710179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33cf16c0>}
2025-05-07T20:32:46.2711692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.2712764Z context = <triton._C.libtriton.ir.context object at 0x7fdb328bedf0>
2025-05-07T20:32:46.2713064Z 
2025-05-07T20:32:46.2713240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.2713788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.2714271Z                            module_map=module_map)
2025-05-07T20:32:46.2714641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.2715009Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.2715284Z E       ^
2025-05-07T20:32:46.2715767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.2716237Z 
2025-05-07T20:32:46.2716668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.2717204Z 
2025-05-07T20:32:46.2717312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.2717738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.2718157Z     T=16384,
2025-05-07T20:32:46.2718428Z     D=7168,
2025-05-07T20:32:46.2718633Z     scale_ub=1200.0,
2025-05-07T20:32:46.2718864Z     contiguous=False,
2025-05-07T20:32:46.2719088Z     compiled=False,
2025-05-07T20:32:46.2719300Z )
2025-05-07T20:32:46.9953080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.9953867Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:46.9960966Z 
2025-05-07T20:32:46.9961116Z     @given(
2025-05-07T20:32:46.9961374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.9962002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.9962319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.9962657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.9962990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.9963279Z     )
2025-05-07T20:32:46.9963644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.9964098Z     def test_silu_mul_quant(
2025-05-07T20:32:46.9964345Z         self,
2025-05-07T20:32:46.9964551Z         T: int,
2025-05-07T20:32:46.9964748Z         D: int,
2025-05-07T20:32:46.9964978Z         scale_ub: Optional[float],
2025-05-07T20:32:46.9965256Z         contiguous: bool,
2025-05-07T20:32:46.9965495Z         compiled: bool,
2025-05-07T20:32:46.9965736Z     ) -> None:
2025-05-07T20:32:46.9965958Z         torch.manual_seed(2025)
2025-05-07T20:32:46.9966200Z     
2025-05-07T20:32:46.9966482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.9966888Z     
2025-05-07T20:32:46.9967084Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.9967382Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.9967703Z         x = x_sign * x_clamp
2025-05-07T20:32:46.9967950Z         x0 = x[:, :D]
2025-05-07T20:32:46.9968166Z         x1 = x[:, D:]
2025-05-07T20:32:46.9968382Z     
2025-05-07T20:32:46.9968576Z         if contiguous:
2025-05-07T20:32:46.9968845Z             x0 = x0.contiguous()
2025-05-07T20:32:46.9969101Z             x1 = x1.contiguous()
2025-05-07T20:32:46.9969343Z     
2025-05-07T20:32:46.9969540Z         if scale_ub is not None:
2025-05-07T20:32:46.9969815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.9970163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.9970479Z             )
2025-05-07T20:32:46.9970675Z         else:
2025-05-07T20:32:46.9970885Z             scale_ub_tensor = None
2025-05-07T20:32:46.9971142Z     
2025-05-07T20:32:46.9971378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.9971694Z             op = silu_mul_quant
2025-05-07T20:32:46.9972029Z             if compiled:
2025-05-07T20:32:46.9972374Z                 op = torch.compile(op)
2025-05-07T20:32:46.9972675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.9972957Z     
2025-05-07T20:32:46.9973152Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.9973321Z 
2025-05-07T20:32:46.9973423Z moe/activation_test.py:117: 
2025-05-07T20:32:46.9973727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.9974077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.9974367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.9975258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.9975979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.9976540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.9977237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.9977929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.9978486Z     kernel = self.compile(
2025-05-07T20:32:46.9979132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.9979807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.9980291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.9980528Z 
2025-05-07T20:32:46.9980746Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32a48ef0>
2025-05-07T20:32:46.9981869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.9983340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32b9cea0>}
2025-05-07T20:32:46.9984737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.9985798Z context = <triton._C.libtriton.ir.context object at 0x7fdb3251c430>
2025-05-07T20:32:46.9986097Z 
2025-05-07T20:32:46.9986276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.9986810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.9987292Z                            module_map=module_map)
2025-05-07T20:32:46.9987669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.9988035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.9988298Z E       ^
2025-05-07T20:32:46.9988779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.9989247Z 
2025-05-07T20:32:46.9989687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.9990217Z 
2025-05-07T20:32:46.9990327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.9990752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.9991168Z     T=1,
2025-05-07T20:32:46.9991355Z     D=7168,
2025-05-07T20:32:46.9991547Z     scale_ub=None,
2025-05-07T20:32:46.9991766Z     contiguous=True,
2025-05-07T20:32:46.9991994Z     compiled=True,
2025-05-07T20:32:46.9992197Z )
2025-05-07T20:32:46.9992529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.9993028Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.9993293Z 
2025-05-07T20:32:46.9993371Z     @given(
2025-05-07T20:32:46.9993649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.9993975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.9994293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.9994630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.9994969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.9995264Z     )
2025-05-07T20:32:46.9995621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.9996079Z     def test_silu_mul_quant(
2025-05-07T20:32:46.9996327Z         self,
2025-05-07T20:32:46.9996524Z         T: int,
2025-05-07T20:32:46.9996760Z         D: int,
2025-05-07T20:32:46.9997005Z         scale_ub: Optional[float],
2025-05-07T20:32:46.9997279Z         contiguous: bool,
2025-05-07T20:32:46.9997528Z         compiled: bool,
2025-05-07T20:32:46.9997760Z     ) -> None:
2025-05-07T20:32:46.9997975Z         torch.manual_seed(2025)
2025-05-07T20:32:46.9998228Z     
2025-05-07T20:32:46.9998507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.9998850Z     
2025-05-07T20:32:46.9999089Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.9999390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.9999708Z         x = x_sign * x_clamp
2025-05-07T20:32:46.9999950Z         x0 = x[:, :D]
2025-05-07T20:32:47.0000255Z         x1 = x[:, D:]
2025-05-07T20:32:47.0000473Z     
2025-05-07T20:32:47.0000659Z         if contiguous:
2025-05-07T20:32:47.0000896Z             x0 = x0.contiguous()
2025-05-07T20:32:47.0001162Z             x1 = x1.contiguous()
2025-05-07T20:32:47.0001403Z     
2025-05-07T20:32:47.0001602Z         if scale_ub is not None:
2025-05-07T20:32:47.0001926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.0002266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.0002585Z             )
2025-05-07T20:32:47.0002787Z         else:
2025-05-07T20:32:47.0003003Z             scale_ub_tensor = None
2025-05-07T20:32:47.0003262Z     
2025-05-07T20:32:47.0003501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.0003818Z             op = silu_mul_quant
2025-05-07T20:32:47.0004079Z             if compiled:
2025-05-07T20:32:47.0004335Z                 op = torch.compile(op)
2025-05-07T20:32:47.0004635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.0004915Z     
2025-05-07T20:32:47.0005111Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.0005405Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.0005697Z     
2025-05-07T20:32:47.0005941Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.0006607Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.0006932Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.0007282Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.0007689Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.0008037Z     
2025-05-07T20:32:47.0008253Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.0008474Z 
2025-05-07T20:32:47.0008590Z moe/activation_test.py:126: 
2025-05-07T20:32:47.0008920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0009296Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.0009667Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.0010626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.0011537Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.0012197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.0012904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.0013697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.0014446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.0015201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.0015869Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.0016495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.0017025Z     fn()
2025-05-07T20:32:47.0017546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.0018151Z     self.fn.run(
2025-05-07T20:32:47.0018626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.0019178Z     kernel = self.compile(
2025-05-07T20:32:47.0019740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.0020481Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.0020893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0021136Z 
2025-05-07T20:32:47.0021421Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32a49460>
2025-05-07T20:32:47.0022546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.0024030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bc6f20>}
2025-05-07T20:32:47.0025425Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.0026491Z context = <triton._C.libtriton.ir.context object at 0x7fdb323982b0>
2025-05-07T20:32:47.0026794Z 
2025-05-07T20:32:47.0026964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.0027507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.0027985Z                            module_map=module_map)
2025-05-07T20:32:47.0028361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.0028727Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.0029000Z E       ^
2025-05-07T20:32:47.0029483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.0029957Z 
2025-05-07T20:32:47.0030389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.0030921Z 
2025-05-07T20:32:47.0031034Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.0031454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.0031870Z     T=4096,
2025-05-07T20:32:47.0032061Z     D=5120,
2025-05-07T20:32:47.0032252Z     scale_ub=None,
2025-05-07T20:32:47.0032474Z     contiguous=False,
2025-05-07T20:32:47.0032709Z     compiled=False,
2025-05-07T20:32:47.0032908Z )
2025-05-07T20:32:47.7960026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7960776Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.7961191Z 
2025-05-07T20:32:47.7961318Z     @given(
2025-05-07T20:32:47.7961584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7961902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7962497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7962831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7963192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7963479Z     )
2025-05-07T20:32:47.7963838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7964299Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7964542Z         self,
2025-05-07T20:32:47.7964742Z         T: int,
2025-05-07T20:32:47.7964947Z         D: int,
2025-05-07T20:32:47.7965168Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7965446Z         contiguous: bool,
2025-05-07T20:32:47.7965692Z         compiled: bool,
2025-05-07T20:32:47.7965924Z     ) -> None:
2025-05-07T20:32:47.7966145Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7966398Z     
2025-05-07T20:32:47.7966671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7967034Z     
2025-05-07T20:32:47.7967271Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7967569Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7967981Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7968230Z         x0 = x[:, :D]
2025-05-07T20:32:47.7968450Z         x1 = x[:, D:]
2025-05-07T20:32:47.7968654Z     
2025-05-07T20:32:47.7968846Z         if contiguous:
2025-05-07T20:32:47.7969168Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7969426Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7969668Z     
2025-05-07T20:32:47.7969862Z         if scale_ub is not None:
2025-05-07T20:32:47.7970135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7970477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7970877Z             )
2025-05-07T20:32:47.7971069Z         else:
2025-05-07T20:32:47.7971284Z             scale_ub_tensor = None
2025-05-07T20:32:47.7971540Z     
2025-05-07T20:32:47.7971776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7972200Z             op = silu_mul_quant
2025-05-07T20:32:47.7972456Z             if compiled:
2025-05-07T20:32:47.7972709Z                 op = torch.compile(op)
2025-05-07T20:32:47.7973010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7973293Z     
2025-05-07T20:32:47.7973492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.7973659Z 
2025-05-07T20:32:47.7973764Z moe/activation_test.py:117: 
2025-05-07T20:32:47.7974098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7974440Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.7974728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7975634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.7976356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.7976914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7977619Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7978305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7978857Z     kernel = self.compile(
2025-05-07T20:32:47.7979416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7980089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7980501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7980744Z 
2025-05-07T20:32:47.7980959Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32c5a480>
2025-05-07T20:32:47.7982142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7983588Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bc7ec0>}
2025-05-07T20:32:47.7984986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7986052Z context = <triton._C.libtriton.ir.context object at 0x7fdb0dbe7070>
2025-05-07T20:32:47.7986348Z 
2025-05-07T20:32:47.7986525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.7987120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.7987594Z                            module_map=module_map)
2025-05-07T20:32:47.7987972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.7988335Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.7988596Z E       ^
2025-05-07T20:32:47.7989119Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.7989584Z 
2025-05-07T20:32:47.7990022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.7990594Z 
2025-05-07T20:32:47.7990708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.7991127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.7991544Z     T=4096,
2025-05-07T20:32:47.7991736Z     D=7168,
2025-05-07T20:32:47.7991971Z     scale_ub=None,
2025-05-07T20:32:47.7992192Z     contiguous=False,
2025-05-07T20:32:47.7992424Z     compiled=False,
2025-05-07T20:32:47.7992628Z )
2025-05-07T20:32:47.7992964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7993476Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.7993757Z 
2025-05-07T20:32:47.7993842Z     @given(
2025-05-07T20:32:47.7994074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7994395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7994709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7995041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7995377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7995671Z     )
2025-05-07T20:32:47.7996023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7996475Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7996728Z         self,
2025-05-07T20:32:47.7996920Z         T: int,
2025-05-07T20:32:47.7997121Z         D: int,
2025-05-07T20:32:47.7997341Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7997615Z         contiguous: bool,
2025-05-07T20:32:47.7997860Z         compiled: bool,
2025-05-07T20:32:47.7998090Z     ) -> None:
2025-05-07T20:32:47.7998310Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7998555Z     
2025-05-07T20:32:47.7998832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7999185Z     
2025-05-07T20:32:47.7999378Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7999678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7999996Z         x = x_sign * x_clamp
2025-05-07T20:32:47.8000238Z         x0 = x[:, :D]
2025-05-07T20:32:47.8000461Z         x1 = x[:, D:]
2025-05-07T20:32:47.8000672Z     
2025-05-07T20:32:47.8000856Z         if contiguous:
2025-05-07T20:32:47.8001095Z             x0 = x0.contiguous()
2025-05-07T20:32:47.8001362Z             x1 = x1.contiguous()
2025-05-07T20:32:47.8001601Z     
2025-05-07T20:32:47.8001799Z         if scale_ub is not None:
2025-05-07T20:32:47.8002164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.8002501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.8002815Z             )
2025-05-07T20:32:47.8003013Z         else:
2025-05-07T20:32:47.8003231Z             scale_ub_tensor = None
2025-05-07T20:32:47.8003481Z     
2025-05-07T20:32:47.8003720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.8004046Z             op = silu_mul_quant
2025-05-07T20:32:47.8004295Z             if compiled:
2025-05-07T20:32:47.8004545Z                 op = torch.compile(op)
2025-05-07T20:32:47.8004846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8005119Z     
2025-05-07T20:32:47.8005313Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.8005478Z 
2025-05-07T20:32:47.8005588Z moe/activation_test.py:117: 
2025-05-07T20:32:47.8005885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8006501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.8006792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8007608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.8008320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.8008874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.8009645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.8010326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.8010878Z     kernel = self.compile(
2025-05-07T20:32:47.8011438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.8012244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.8012653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8012895Z 
2025-05-07T20:32:47.8013109Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32b5bc50>
2025-05-07T20:32:47.8014235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.8015665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bad620>}
2025-05-07T20:32:47.8017114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.8018189Z context = <triton._C.libtriton.ir.context object at 0x7fdb0da9c2f0>
2025-05-07T20:32:47.8018493Z 
2025-05-07T20:32:47.8018666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.8019206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.8019684Z                            module_map=module_map)
2025-05-07T20:32:47.8020058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.8020423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.8020689Z E       ^
2025-05-07T20:32:47.8021160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.8021632Z 
2025-05-07T20:32:47.8022061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.8022594Z 
2025-05-07T20:32:47.8022706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.8023127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.8023616Z     T=128,
2025-05-07T20:32:47.8023813Z     D=7168,
2025-05-07T20:32:47.8024014Z     scale_ub=None,
2025-05-07T20:32:47.8024233Z     contiguous=False,
2025-05-07T20:32:47.8024471Z     compiled=True,
2025-05-07T20:32:47.8024676Z )
2025-05-07T20:32:47.8583899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.8584638Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.8585033Z 
2025-05-07T20:32:47.8585172Z     @given(
2025-05-07T20:32:47.8585488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.8585840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.8586158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.8586552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.8586888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.8587171Z     )
2025-05-07T20:32:47.8587530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.8587983Z     def test_silu_mul_quant(
2025-05-07T20:32:47.8588223Z         self,
2025-05-07T20:32:47.8588651Z         T: int,
2025-05-07T20:32:47.8588856Z         D: int,
2025-05-07T20:32:47.8589071Z         scale_ub: Optional[float],
2025-05-07T20:32:47.8589346Z         contiguous: bool,
2025-05-07T20:32:47.8589595Z         compiled: bool,
2025-05-07T20:32:47.8589889Z     ) -> None:
2025-05-07T20:32:47.8590107Z         torch.manual_seed(2025)
2025-05-07T20:32:47.8590354Z     
2025-05-07T20:32:47.8590629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.8590977Z     
2025-05-07T20:32:47.8591177Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.8591545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.8591867Z         x = x_sign * x_clamp
2025-05-07T20:32:47.8592110Z         x0 = x[:, :D]
2025-05-07T20:32:47.8592329Z         x1 = x[:, D:]
2025-05-07T20:32:47.8592534Z     
2025-05-07T20:32:47.8592729Z         if contiguous:
2025-05-07T20:32:47.8592965Z             x0 = x0.contiguous()
2025-05-07T20:32:47.8593222Z             x1 = x1.contiguous()
2025-05-07T20:32:47.8593467Z     
2025-05-07T20:32:47.8593659Z         if scale_ub is not None:
2025-05-07T20:32:47.8593928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.8594283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.8601220Z             )
2025-05-07T20:32:47.8601436Z         else:
2025-05-07T20:32:47.8601653Z             scale_ub_tensor = None
2025-05-07T20:32:47.8601918Z     
2025-05-07T20:32:47.8602158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.8602487Z             op = silu_mul_quant
2025-05-07T20:32:47.8602750Z             if compiled:
2025-05-07T20:32:47.8603000Z                 op = torch.compile(op)
2025-05-07T20:32:47.8603307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8603591Z     
2025-05-07T20:32:47.8603793Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.8604081Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.8604382Z     
2025-05-07T20:32:47.8604628Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.8604967Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.8605272Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.8605597Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.8605960Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.8606556Z     
2025-05-07T20:32:47.8606765Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.8606967Z 
2025-05-07T20:32:47.8607076Z moe/activation_test.py:126: 
2025-05-07T20:32:47.8607379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8607724Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.8608061Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.8609001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.8609789Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.8610350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.8611054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.8611758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.8612580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.8613341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.8613992Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.8614617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.8615149Z     fn()
2025-05-07T20:32:47.8615746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.8616378Z     self.fn.run(
2025-05-07T20:32:47.8616875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.8617489Z     kernel = self.compile(
2025-05-07T20:32:47.8618039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.8618713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.8619189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8619426Z 
2025-05-07T20:32:47.8619646Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb321cd640>
2025-05-07T20:32:47.8620770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.8622206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bacd60>}
2025-05-07T20:32:47.8623602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.8624666Z context = <triton._C.libtriton.ir.context object at 0x7fdb322af6b0>
2025-05-07T20:32:47.8624965Z 
2025-05-07T20:32:47.8625141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.8625676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.8626174Z                            module_map=module_map)
2025-05-07T20:32:47.8626587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.8626947Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.8627220Z E       ^
2025-05-07T20:32:47.8627697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.8628163Z 
2025-05-07T20:32:47.8628600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.8629131Z 
2025-05-07T20:32:47.8629237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.8629670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.8630086Z     T=128,
2025-05-07T20:32:47.8630273Z     D=7168,
2025-05-07T20:32:47.8630472Z     scale_ub=None,
2025-05-07T20:32:47.8630745Z     contiguous=False,
2025-05-07T20:32:47.8630973Z     compiled=False,
2025-05-07T20:32:47.8631192Z )
2025-05-07T20:32:48.0597567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.0598325Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.0598711Z 
2025-05-07T20:32:48.0598827Z     @given(
2025-05-07T20:32:48.0599127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.0599538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.0599897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.0600231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.0600558Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.0600855Z     )
2025-05-07T20:32:48.0601208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.0601659Z     def test_silu_mul_quant(
2025-05-07T20:32:48.0601911Z         self,
2025-05-07T20:32:48.0602118Z         T: int,
2025-05-07T20:32:48.0602316Z         D: int,
2025-05-07T20:32:48.0602541Z         scale_ub: Optional[float],
2025-05-07T20:32:48.0603088Z         contiguous: bool,
2025-05-07T20:32:48.0603335Z         compiled: bool,
2025-05-07T20:32:48.0603573Z     ) -> None:
2025-05-07T20:32:48.0603794Z         torch.manual_seed(2025)
2025-05-07T20:32:48.0604044Z     
2025-05-07T20:32:48.0604414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.0604768Z     
2025-05-07T20:32:48.0604968Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.0605262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.0605586Z         x = x_sign * x_clamp
2025-05-07T20:32:48.0605934Z         x0 = x[:, :D]
2025-05-07T20:32:48.0606459Z         x1 = x[:, D:]
2025-05-07T20:32:48.0606718Z     
2025-05-07T20:32:48.0606912Z         if contiguous:
2025-05-07T20:32:48.0607147Z             x0 = x0.contiguous()
2025-05-07T20:32:48.0607424Z             x1 = x1.contiguous()
2025-05-07T20:32:48.0607673Z     
2025-05-07T20:32:48.0607865Z         if scale_ub is not None:
2025-05-07T20:32:48.0608153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.0608501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.0608815Z             )
2025-05-07T20:32:48.0609018Z         else:
2025-05-07T20:32:48.0609241Z             scale_ub_tensor = None
2025-05-07T20:32:48.0609498Z     
2025-05-07T20:32:48.0609738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.0610064Z             op = silu_mul_quant
2025-05-07T20:32:48.0610323Z             if compiled:
2025-05-07T20:32:48.0610571Z                 op = torch.compile(op)
2025-05-07T20:32:48.0610879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0611165Z     
2025-05-07T20:32:48.0611357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.0611536Z 
2025-05-07T20:32:48.0611639Z moe/activation_test.py:117: 
2025-05-07T20:32:48.0612037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0612375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.0612666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0613384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.0614104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.0614651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.0615357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.0616051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.0616600Z     kernel = self.compile(
2025-05-07T20:32:48.0617261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.0617948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.0618365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0618602Z 
2025-05-07T20:32:48.0618815Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32bd9b50>
2025-05-07T20:32:48.0619937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.0621386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd92340>}
2025-05-07T20:32:48.0622799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.0623890Z context = <triton._C.libtriton.ir.context object at 0x7fdb3230adf0>
2025-05-07T20:32:48.0624192Z 
2025-05-07T20:32:48.0624432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.0624985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.0625538Z                            module_map=module_map)
2025-05-07T20:32:48.0625912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.0626281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.0626553Z E       ^
2025-05-07T20:32:48.0627047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.0627631Z 
2025-05-07T20:32:48.0628075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.0628632Z 
2025-05-07T20:32:48.0628739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.0629176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.0629597Z     T=4096,
2025-05-07T20:32:48.0629795Z     D=5120,
2025-05-07T20:32:48.0629998Z     scale_ub=1200.0,
2025-05-07T20:32:48.0630228Z     contiguous=True,
2025-05-07T20:32:48.0630454Z     compiled=False,
2025-05-07T20:32:48.0630677Z )
2025-05-07T20:32:48.0631020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.0631541Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:48.0631836Z 
2025-05-07T20:32:48.0631917Z     @given(
2025-05-07T20:32:48.0632155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.0632479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.0632799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.0633147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.0633482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.0633781Z     )
2025-05-07T20:32:48.0634152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.0634619Z     def test_silu_mul_quant(
2025-05-07T20:32:48.0634867Z         self,
2025-05-07T20:32:48.0635070Z         T: int,
2025-05-07T20:32:48.0635281Z         D: int,
2025-05-07T20:32:48.0635502Z         scale_ub: Optional[float],
2025-05-07T20:32:48.0635787Z         contiguous: bool,
2025-05-07T20:32:48.0636038Z         compiled: bool,
2025-05-07T20:32:48.0636262Z     ) -> None:
2025-05-07T20:32:48.0636494Z         torch.manual_seed(2025)
2025-05-07T20:32:48.0636777Z     
2025-05-07T20:32:48.0637054Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.0637411Z     
2025-05-07T20:32:48.0637609Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.0637903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.0638279Z         x = x_sign * x_clamp
2025-05-07T20:32:48.0638527Z         x0 = x[:, :D]
2025-05-07T20:32:48.0638745Z         x1 = x[:, D:]
2025-05-07T20:32:48.0638962Z     
2025-05-07T20:32:48.0639155Z         if contiguous:
2025-05-07T20:32:48.0639395Z             x0 = x0.contiguous()
2025-05-07T20:32:48.0639656Z             x1 = x1.contiguous()
2025-05-07T20:32:48.0639908Z     
2025-05-07T20:32:48.0640106Z         if scale_ub is not None:
2025-05-07T20:32:48.0640382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.0640729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.0641050Z             )
2025-05-07T20:32:48.0641243Z         else:
2025-05-07T20:32:48.0641457Z             scale_ub_tensor = None
2025-05-07T20:32:48.0641738Z     
2025-05-07T20:32:48.0641980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.0642302Z             op = silu_mul_quant
2025-05-07T20:32:48.0642567Z             if compiled:
2025-05-07T20:32:48.0642826Z                 op = torch.compile(op)
2025-05-07T20:32:48.0643132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0643411Z     
2025-05-07T20:32:48.0643666Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.0643838Z 
2025-05-07T20:32:48.0643950Z moe/activation_test.py:117: 
2025-05-07T20:32:48.0644253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0644639Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.0644931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0645639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.0646361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.0647000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.0647728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.0648415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.0648974Z     kernel = self.compile(
2025-05-07T20:32:48.0649537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.0650207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.0650625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0650873Z 
2025-05-07T20:32:48.0651087Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32bda390>
2025-05-07T20:32:48.0652261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.0653693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd91c60>}
2025-05-07T20:32:48.0655086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.0656152Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d581130>
2025-05-07T20:32:48.0656457Z 
2025-05-07T20:32:48.0656639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.0657219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.0657698Z                            module_map=module_map)
2025-05-07T20:32:48.0658075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.0658439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.0658703Z E       ^
2025-05-07T20:32:48.0659237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.0659712Z 
2025-05-07T20:32:48.0660151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.0660682Z 
2025-05-07T20:32:48.0660794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.0661218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.0661636Z     T=1,
2025-05-07T20:32:48.0661823Z     D=5120,
2025-05-07T20:32:48.0662016Z     scale_ub=None,
2025-05-07T20:32:48.0662244Z     contiguous=True,
2025-05-07T20:32:48.0662473Z     compiled=True,
2025-05-07T20:32:48.0662676Z )
2025-05-07T20:32:48.4399016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4399751Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.4400076Z 
2025-05-07T20:32:48.4400184Z     @given(
2025-05-07T20:32:48.4400423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4401010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4401338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4401673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4401997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4402380Z     )
2025-05-07T20:32:48.4402743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4403201Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4403443Z         self,
2025-05-07T20:32:48.4403649Z         T: int,
2025-05-07T20:32:48.4403851Z         D: int,
2025-05-07T20:32:48.4404157Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4404432Z         contiguous: bool,
2025-05-07T20:32:48.4404678Z         compiled: bool,
2025-05-07T20:32:48.4404908Z     ) -> None:
2025-05-07T20:32:48.4405136Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4405384Z     
2025-05-07T20:32:48.4405653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4406004Z     
2025-05-07T20:32:48.4406559Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4406858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4407174Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4407461Z         x0 = x[:, :D]
2025-05-07T20:32:48.4407681Z         x1 = x[:, D:]
2025-05-07T20:32:48.4407896Z     
2025-05-07T20:32:48.4408075Z         if contiguous:
2025-05-07T20:32:48.4408296Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4408556Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4408791Z     
2025-05-07T20:32:48.4408975Z         if scale_ub is not None:
2025-05-07T20:32:48.4409247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4409583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4409897Z             )
2025-05-07T20:32:48.4410088Z         else:
2025-05-07T20:32:48.4410301Z             scale_ub_tensor = None
2025-05-07T20:32:48.4410554Z     
2025-05-07T20:32:48.4410789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4411108Z             op = silu_mul_quant
2025-05-07T20:32:48.4411362Z             if compiled:
2025-05-07T20:32:48.4411605Z                 op = torch.compile(op)
2025-05-07T20:32:48.4412021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4412303Z     
2025-05-07T20:32:48.4412489Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.4412778Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.4413074Z     
2025-05-07T20:32:48.4413307Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4413649Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.4413945Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.4414269Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.4414743Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.4415068Z     
2025-05-07T20:32:48.4415277Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.4415479Z 
2025-05-07T20:32:48.4415584Z moe/activation_test.py:126: 
2025-05-07T20:32:48.4415890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4416233Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.4416568Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.4417384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.4418163Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.4418731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4419433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4420144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.4420958Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.4421716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.4422432Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.4423056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.4423593Z     fn()
2025-05-07T20:32:48.4424110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.4424783Z     self.fn.run(
2025-05-07T20:32:48.4425262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4425809Z     kernel = self.compile(
2025-05-07T20:32:48.4426361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4427075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4427494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4427733Z 
2025-05-07T20:32:48.4427950Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0dd2ea80>
2025-05-07T20:32:48.4429064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4430514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd932e0>}
2025-05-07T20:32:48.4431917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4432985Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d5df0f0>
2025-05-07T20:32:48.4433281Z 
2025-05-07T20:32:48.4433456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4433993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4434474Z                            module_map=module_map)
2025-05-07T20:32:48.4434848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4435210Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.4435487Z E       ^
2025-05-07T20:32:48.4435964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4436476Z 
2025-05-07T20:32:48.4436905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4437444Z 
2025-05-07T20:32:48.4437547Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4437970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4438381Z     T=2048,
2025-05-07T20:32:48.4438568Z     D=5120,
2025-05-07T20:32:48.4438763Z     scale_ub=None,
2025-05-07T20:32:48.4438983Z     contiguous=True,
2025-05-07T20:32:48.4439203Z     compiled=True,
2025-05-07T20:32:48.4439411Z )
2025-05-07T20:32:48.8033981Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8035457Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.8036239Z 
2025-05-07T20:32:48.8036389Z     @given(
2025-05-07T20:32:48.8036695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8037051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8037355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8038004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8038345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8038628Z     )
2025-05-07T20:32:48.8038984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8039565Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8039806Z         self,
2025-05-07T20:32:48.8040007Z         T: int,
2025-05-07T20:32:48.8040210Z         D: int,
2025-05-07T20:32:48.8040430Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8040708Z         contiguous: bool,
2025-05-07T20:32:48.8040957Z         compiled: bool,
2025-05-07T20:32:48.8041264Z     ) -> None:
2025-05-07T20:32:48.8041488Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8041737Z     
2025-05-07T20:32:48.8042017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8042366Z     
2025-05-07T20:32:48.8042561Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8042855Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8043165Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8043407Z         x0 = x[:, :D]
2025-05-07T20:32:48.8043629Z         x1 = x[:, D:]
2025-05-07T20:32:48.8043831Z     
2025-05-07T20:32:48.8044020Z         if contiguous:
2025-05-07T20:32:48.8044258Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8044514Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8044758Z     
2025-05-07T20:32:48.8044952Z         if scale_ub is not None:
2025-05-07T20:32:48.8045220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8045561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8045881Z             )
2025-05-07T20:32:48.8046071Z         else:
2025-05-07T20:32:48.8046288Z             scale_ub_tensor = None
2025-05-07T20:32:48.8046541Z     
2025-05-07T20:32:48.8053832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8054172Z             op = silu_mul_quant
2025-05-07T20:32:48.8054437Z             if compiled:
2025-05-07T20:32:48.8054694Z                 op = torch.compile(op)
2025-05-07T20:32:48.8054996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8055268Z     
2025-05-07T20:32:48.8055465Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.8055761Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.8056057Z     
2025-05-07T20:32:48.8056299Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8056650Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.8056990Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.8057317Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.8057683Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.8057998Z     
2025-05-07T20:32:48.8058531Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.8058747Z 
2025-05-07T20:32:48.8058852Z moe/activation_test.py:126: 
2025-05-07T20:32:48.8059167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8059507Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.8059845Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.8060662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.8061445Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.8061999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8062704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8063414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.8064165Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.8064972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.8065637Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.8066255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.8066833Z     fn()
2025-05-07T20:32:48.8067353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.8067954Z     self.fn.run(
2025-05-07T20:32:48.8068433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8069018Z     kernel = self.compile(
2025-05-07T20:32:48.8069579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8070254Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8070657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8070898Z 
2025-05-07T20:32:48.8071110Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32162cf0>
2025-05-07T20:32:48.8072232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8073667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32475ee0>}
2025-05-07T20:32:48.8075060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8076116Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d7bc9b0>
2025-05-07T20:32:48.8076424Z 
2025-05-07T20:32:48.8076593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8077140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8077660Z                            module_map=module_map)
2025-05-07T20:32:48.8078026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8078391Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.8078662Z E       ^
2025-05-07T20:32:48.8079135Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8079606Z 
2025-05-07T20:32:48.8080087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.8080628Z 
2025-05-07T20:32:48.8080732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.8081162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.8081570Z     T=128,
2025-05-07T20:32:48.8081764Z     D=5120,
2025-05-07T20:32:48.8081962Z     scale_ub=None,
2025-05-07T20:32:48.8082175Z     contiguous=True,
2025-05-07T20:32:48.8082410Z     compiled=True,
2025-05-07T20:32:48.8082617Z )
2025-05-07T20:32:49.2266964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.2267714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.2268036Z 
2025-05-07T20:32:49.2268121Z     @given(
2025-05-07T20:32:49.2268390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.2268706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.2269018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.2269363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.2269693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.2269988Z     )
2025-05-07T20:32:49.2270634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.2271092Z     def test_silu_mul_quant(
2025-05-07T20:32:49.2271349Z         self,
2025-05-07T20:32:49.2271557Z         T: int,
2025-05-07T20:32:49.2271848Z         D: int,
2025-05-07T20:32:49.2272078Z         scale_ub: Optional[float],
2025-05-07T20:32:49.2272357Z         contiguous: bool,
2025-05-07T20:32:49.2272599Z         compiled: bool,
2025-05-07T20:32:49.2272836Z     ) -> None:
2025-05-07T20:32:49.2273060Z         torch.manual_seed(2025)
2025-05-07T20:32:49.2273311Z     
2025-05-07T20:32:49.2273674Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.2274027Z     
2025-05-07T20:32:49.2274229Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.2274526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.2274847Z         x = x_sign * x_clamp
2025-05-07T20:32:49.2275100Z         x0 = x[:, :D]
2025-05-07T20:32:49.2275319Z         x1 = x[:, D:]
2025-05-07T20:32:49.2275537Z     
2025-05-07T20:32:49.2275731Z         if contiguous:
2025-05-07T20:32:49.2275967Z             x0 = x0.contiguous()
2025-05-07T20:32:49.2276242Z             x1 = x1.contiguous()
2025-05-07T20:32:49.2276496Z     
2025-05-07T20:32:49.2276688Z         if scale_ub is not None:
2025-05-07T20:32:49.2276968Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.2277314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.2277627Z             )
2025-05-07T20:32:49.2277831Z         else:
2025-05-07T20:32:49.2278053Z             scale_ub_tensor = None
2025-05-07T20:32:49.2278315Z     
2025-05-07T20:32:49.2278553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.2278876Z             op = silu_mul_quant
2025-05-07T20:32:49.2279137Z             if compiled:
2025-05-07T20:32:49.2279387Z                 op = torch.compile(op)
2025-05-07T20:32:49.2279691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.2279976Z     
2025-05-07T20:32:49.2280176Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.2280472Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.2280778Z     
2025-05-07T20:32:49.2281020Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.2281373Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.2281681Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.2282001Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.2282377Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.2282705Z     
2025-05-07T20:32:49.2282917Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.2283116Z 
2025-05-07T20:32:49.2283220Z moe/activation_test.py:126: 
2025-05-07T20:32:49.2283625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2283979Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.2284314Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.2285138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.2285927Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.2286495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.2287201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.2287917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.2288673Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.2289435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.2290144Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.2290769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.2291304Z     fn()
2025-05-07T20:32:49.2291907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.2292561Z     self.fn.run(
2025-05-07T20:32:49.2293042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.2293595Z     kernel = self.compile(
2025-05-07T20:32:49.2294195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.2294877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.2295296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2295534Z 
2025-05-07T20:32:49.2295751Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0dd2e390>
2025-05-07T20:32:49.2296874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.2298324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d6fdd00>}
2025-05-07T20:32:49.2299720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.2300793Z context = <triton._C.libtriton.ir.context object at 0x7fdb32074e30>
2025-05-07T20:32:49.2301090Z 
2025-05-07T20:32:49.2301261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.2301807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.2302295Z                            module_map=module_map)
2025-05-07T20:32:49.2302672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.2303041Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.2303319Z E       ^
2025-05-07T20:32:49.2303806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.2304274Z 
2025-05-07T20:32:49.2304707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.2305252Z 
2025-05-07T20:32:49.2305358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.2305839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.2306599Z     T=4096,
2025-05-07T20:32:49.2306791Z     D=5120,
2025-05-07T20:32:49.2306993Z     scale_ub=None,
2025-05-07T20:32:49.2307222Z     contiguous=True,
2025-05-07T20:32:49.2307450Z     compiled=True,
2025-05-07T20:32:49.2307666Z )
2025-05-07T20:32:49.6554327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.6555085Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.6555367Z 
2025-05-07T20:32:49.6555464Z     @given(
2025-05-07T20:32:49.6555707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.6556021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.6556342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.6556675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.6557008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.6557299Z     )
2025-05-07T20:32:49.6557666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.6558118Z     def test_silu_mul_quant(
2025-05-07T20:32:49.6558709Z         self,
2025-05-07T20:32:49.6558919Z         T: int,
2025-05-07T20:32:49.6559119Z         D: int,
2025-05-07T20:32:49.6559345Z         scale_ub: Optional[float],
2025-05-07T20:32:49.6559626Z         contiguous: bool,
2025-05-07T20:32:49.6559952Z         compiled: bool,
2025-05-07T20:32:49.6560195Z     ) -> None:
2025-05-07T20:32:49.6560422Z         torch.manual_seed(2025)
2025-05-07T20:32:49.6560676Z     
2025-05-07T20:32:49.6560951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.6561305Z     
2025-05-07T20:32:49.6561593Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.6561888Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.6562217Z         x = x_sign * x_clamp
2025-05-07T20:32:49.6562467Z         x0 = x[:, :D]
2025-05-07T20:32:49.6562706Z         x1 = x[:, D:]
2025-05-07T20:32:49.6562925Z     
2025-05-07T20:32:49.6563123Z         if contiguous:
2025-05-07T20:32:49.6563359Z             x0 = x0.contiguous()
2025-05-07T20:32:49.6563632Z             x1 = x1.contiguous()
2025-05-07T20:32:49.6563883Z     
2025-05-07T20:32:49.6564077Z         if scale_ub is not None:
2025-05-07T20:32:49.6564361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.6564714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.6565034Z             )
2025-05-07T20:32:49.6565230Z         else:
2025-05-07T20:32:49.6565451Z             scale_ub_tensor = None
2025-05-07T20:32:49.6565713Z     
2025-05-07T20:32:49.6565949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.6566278Z             op = silu_mul_quant
2025-05-07T20:32:49.6566540Z             if compiled:
2025-05-07T20:32:49.6566790Z                 op = torch.compile(op)
2025-05-07T20:32:49.6567099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.6567386Z     
2025-05-07T20:32:49.6567582Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.6567876Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.6568179Z     
2025-05-07T20:32:49.6568418Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.6568763Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.6569066Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.6569393Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.6569761Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.6570081Z     
2025-05-07T20:32:49.6570292Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.6570492Z 
2025-05-07T20:32:49.6570598Z moe/activation_test.py:126: 
2025-05-07T20:32:49.6570907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.6571254Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.6571679Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.6572609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.6573393Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.6573961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.6574664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.6575379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.6576129Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.6576916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.6577599Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.6578223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.6578815Z     fn()
2025-05-07T20:32:49.6579336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.6579944Z     self.fn.run(
2025-05-07T20:32:49.6580472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.6581023Z     kernel = self.compile(
2025-05-07T20:32:49.6581576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.6582304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.6582719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.6582958Z 
2025-05-07T20:32:49.6583182Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32471e50>
2025-05-07T20:32:49.6584305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.6585752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d63b560>}
2025-05-07T20:32:49.6587161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.6588256Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cbf3270>
2025-05-07T20:32:49.6588561Z 
2025-05-07T20:32:49.6588736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.6589291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.6589782Z                            module_map=module_map)
2025-05-07T20:32:49.6590163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.6590527Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.6590805Z E       ^
2025-05-07T20:32:49.6591292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.6591760Z 
2025-05-07T20:32:49.6592191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.6592729Z 
2025-05-07T20:32:49.6592840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.6593268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.6593686Z     T=16384,
2025-05-07T20:32:49.6593884Z     D=5120,
2025-05-07T20:32:49.6594139Z     scale_ub=None,
2025-05-07T20:32:49.6594367Z     contiguous=True,
2025-05-07T20:32:49.6594595Z     compiled=True,
2025-05-07T20:32:49.6594809Z )
2025-05-07T20:32:49.6851707Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:49.6853400Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:49.6854802Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:49.6855838Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:49.6856991Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:49.7744139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7744905Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.7745225Z 
2025-05-07T20:32:49.7745416Z     @given(
2025-05-07T20:32:49.7745647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7745965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7746274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7746603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7747014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7747326Z     )
2025-05-07T20:32:49.7747696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7748147Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7748398Z         self,
2025-05-07T20:32:49.7748590Z         T: int,
2025-05-07T20:32:49.7748789Z         D: int,
2025-05-07T20:32:49.7749011Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7749278Z         contiguous: bool,
2025-05-07T20:32:49.7749523Z         compiled: bool,
2025-05-07T20:32:49.7749758Z     ) -> None:
2025-05-07T20:32:49.7749984Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7750229Z     
2025-05-07T20:32:49.7750508Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7750857Z     
2025-05-07T20:32:49.7751051Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7751350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7751669Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7751914Z         x0 = x[:, :D]
2025-05-07T20:32:49.7752140Z         x1 = x[:, D:]
2025-05-07T20:32:49.7752352Z     
2025-05-07T20:32:49.7752536Z         if contiguous:
2025-05-07T20:32:49.7752781Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7753046Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7753291Z     
2025-05-07T20:32:49.7753488Z         if scale_ub is not None:
2025-05-07T20:32:49.7753770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7754109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7754426Z             )
2025-05-07T20:32:49.7754626Z         else:
2025-05-07T20:32:49.7754836Z             scale_ub_tensor = None
2025-05-07T20:32:49.7755091Z     
2025-05-07T20:32:49.7755326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7755647Z             op = silu_mul_quant
2025-05-07T20:32:49.7755895Z             if compiled:
2025-05-07T20:32:49.7756146Z                 op = torch.compile(op)
2025-05-07T20:32:49.7756450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7756727Z     
2025-05-07T20:32:49.7756924Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.7757305Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.7757601Z     
2025-05-07T20:32:49.7757844Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7758191Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.7758485Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.7758804Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.7759168Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7759485Z     
2025-05-07T20:32:49.7759684Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.7759886Z 
2025-05-07T20:32:49.7759991Z moe/activation_test.py:126: 
2025-05-07T20:32:49.7760296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7760636Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.7760975Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7761794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.7762577Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.7763183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7763897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7764660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.7765400Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.7766158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.7766864Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.7767524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.7768068Z     fn()
2025-05-07T20:32:49.7768597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.7769203Z     self.fn.run(
2025-05-07T20:32:49.7769680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7770234Z     kernel = self.compile(
2025-05-07T20:32:49.7770795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7771471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7771960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7772205Z 
2025-05-07T20:32:49.7772419Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32473380>
2025-05-07T20:32:49.7773550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7775002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ccb9620>}
2025-05-07T20:32:49.7776402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7777476Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d103db0>
2025-05-07T20:32:49.7777780Z 
2025-05-07T20:32:49.7777953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7778494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7779065Z                            module_map=module_map)
2025-05-07T20:32:49.7779444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7779814Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.7780088Z E       ^
2025-05-07T20:32:49.7780559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7781034Z 
2025-05-07T20:32:49.7781468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7781997Z 
2025-05-07T20:32:49.7782108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7782537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7782948Z     T=1,
2025-05-07T20:32:49.7783139Z     D=5120,
2025-05-07T20:32:49.7783339Z     scale_ub=1200.0,
2025-05-07T20:32:49.7783564Z     contiguous=True,
2025-05-07T20:32:49.7783791Z     compiled=True,
2025-05-07T20:32:49.7784005Z )
2025-05-07T20:32:49.9230456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9232306Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.9232958Z 
2025-05-07T20:32:49.9233118Z     @given(
2025-05-07T20:32:49.9233583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9234198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9234972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9235629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9236284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9236841Z     )
2025-05-07T20:32:49.9237532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9238122Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9238362Z         self,
2025-05-07T20:32:49.9238565Z         T: int,
2025-05-07T20:32:49.9245820Z         D: int,
2025-05-07T20:32:49.9246089Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9246378Z         contiguous: bool,
2025-05-07T20:32:49.9246625Z         compiled: bool,
2025-05-07T20:32:49.9246855Z     ) -> None:
2025-05-07T20:32:49.9247076Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9247329Z     
2025-05-07T20:32:49.9247608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9247969Z     
2025-05-07T20:32:49.9248170Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9248472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9248784Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9249033Z         x0 = x[:, :D]
2025-05-07T20:32:49.9249258Z         x1 = x[:, D:]
2025-05-07T20:32:49.9249466Z     
2025-05-07T20:32:49.9249661Z         if contiguous:
2025-05-07T20:32:49.9249904Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9250165Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9250409Z     
2025-05-07T20:32:49.9250609Z         if scale_ub is not None:
2025-05-07T20:32:49.9250882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9251231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9251559Z             )
2025-05-07T20:32:49.9251751Z         else:
2025-05-07T20:32:49.9252065Z             scale_ub_tensor = None
2025-05-07T20:32:49.9252325Z     
2025-05-07T20:32:49.9252558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9252884Z             op = silu_mul_quant
2025-05-07T20:32:49.9253140Z             if compiled:
2025-05-07T20:32:49.9253391Z                 op = torch.compile(op)
2025-05-07T20:32:49.9253685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9253966Z     
2025-05-07T20:32:49.9254166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9254334Z 
2025-05-07T20:32:49.9254436Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9254742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9255217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9255509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9256092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.9256674Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.9257408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9258112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9258666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9259369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9260052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9260603Z     kernel = self.compile(
2025-05-07T20:32:49.9261164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9261890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9262299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9262541Z 
2025-05-07T20:32:49.9262755Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d61fe90>
2025-05-07T20:32:49.9263924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9265363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c844720>}
2025-05-07T20:32:49.9266799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9267911Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d2aec30>
2025-05-07T20:32:49.9268219Z 
2025-05-07T20:32:49.9268387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9268933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9269412Z                            module_map=module_map)
2025-05-07T20:32:49.9269786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9270154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9270419Z E       ^
2025-05-07T20:32:49.9270895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9271366Z 
2025-05-07T20:32:49.9271802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9272332Z 
2025-05-07T20:32:49.9272438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9272861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9273271Z     T=1,
2025-05-07T20:32:49.9273461Z     D=5120,
2025-05-07T20:32:49.9273658Z     scale_ub=None,
2025-05-07T20:32:49.9273876Z     contiguous=False,
2025-05-07T20:32:49.9274107Z     compiled=True,
2025-05-07T20:32:49.9274319Z )
2025-05-07T20:32:50.1412483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1413229Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.1413544Z 
2025-05-07T20:32:50.1413633Z     @given(
2025-05-07T20:32:50.1413862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1414176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1414798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1415128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1415457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1415753Z     )
2025-05-07T20:32:50.1416104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1416556Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1416809Z         self,
2025-05-07T20:32:50.1417009Z         T: int,
2025-05-07T20:32:50.1417204Z         D: int,
2025-05-07T20:32:50.1417440Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1417753Z         contiguous: bool,
2025-05-07T20:32:50.1417989Z         compiled: bool,
2025-05-07T20:32:50.1418223Z     ) -> None:
2025-05-07T20:32:50.1418442Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1418685Z     
2025-05-07T20:32:50.1418960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1419308Z     
2025-05-07T20:32:50.1419501Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1419803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1420119Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1420449Z         x0 = x[:, :D]
2025-05-07T20:32:50.1420673Z         x1 = x[:, D:]
2025-05-07T20:32:50.1420888Z     
2025-05-07T20:32:50.1421075Z         if contiguous:
2025-05-07T20:32:50.1421311Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1421691Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1421930Z     
2025-05-07T20:32:50.1422127Z         if scale_ub is not None:
2025-05-07T20:32:50.1422406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1422748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1423056Z             )
2025-05-07T20:32:50.1423341Z         else:
2025-05-07T20:32:50.1423557Z             scale_ub_tensor = None
2025-05-07T20:32:50.1423805Z     
2025-05-07T20:32:50.1424042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1424367Z             op = silu_mul_quant
2025-05-07T20:32:50.1424619Z             if compiled:
2025-05-07T20:32:50.1424879Z                 op = torch.compile(op)
2025-05-07T20:32:50.1425182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1425458Z     
2025-05-07T20:32:50.1425663Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.1425954Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.1426250Z     
2025-05-07T20:32:50.1426492Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1426830Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.1427127Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.1427441Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.1427808Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.1428123Z     
2025-05-07T20:32:50.1428320Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.1428521Z 
2025-05-07T20:32:50.1428626Z moe/activation_test.py:126: 
2025-05-07T20:32:50.1428932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1429269Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.1429605Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.1430421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.1431200Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.1431753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1432454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1433167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.1433958Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.1434708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.1435366Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.1435984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.1436513Z     fn()
2025-05-07T20:32:50.1437034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.1437635Z     self.fn.run(
2025-05-07T20:32:50.1438115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1438656Z     kernel = self.compile(
2025-05-07T20:32:50.1439210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1439887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1440287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1440529Z 
2025-05-07T20:32:50.1440786Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0cb05b80>
2025-05-07T20:32:50.1441906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1443396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c846c00>}
2025-05-07T20:32:50.1444828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1445888Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d28f2f0>
2025-05-07T20:32:50.1446190Z 
2025-05-07T20:32:50.1446363Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1446902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1447380Z                            module_map=module_map)
2025-05-07T20:32:50.1447748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1448110Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.1448382Z E       ^
2025-05-07T20:32:50.1448853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1449328Z 
2025-05-07T20:32:50.1449754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1450289Z 
2025-05-07T20:32:50.1450395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1450818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1451224Z     T=1,
2025-05-07T20:32:50.1451415Z     D=5120,
2025-05-07T20:32:50.1451612Z     scale_ub=None,
2025-05-07T20:32:50.1451926Z     contiguous=True,
2025-05-07T20:32:50.1452157Z     compiled=False,
2025-05-07T20:32:50.1452367Z )
2025-05-07T20:32:50.2963655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2964416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.2964798Z 
2025-05-07T20:32:50.2964899Z     @given(
2025-05-07T20:32:50.2965139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2965450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2965767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2966103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2966709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2967010Z     )
2025-05-07T20:32:50.2967360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2967818Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2968059Z         self,
2025-05-07T20:32:50.2968268Z         T: int,
2025-05-07T20:32:50.2968458Z         D: int,
2025-05-07T20:32:50.2968678Z         scale_ub: Optional[float],
2025-05-07T20:32:50.2968958Z         contiguous: bool,
2025-05-07T20:32:50.2969190Z         compiled: bool,
2025-05-07T20:32:50.2969415Z     ) -> None:
2025-05-07T20:32:50.2969630Z         torch.manual_seed(2025)
2025-05-07T20:32:50.2969876Z     
2025-05-07T20:32:50.2970145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.2970497Z     
2025-05-07T20:32:50.2970694Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.2970984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.2971300Z         x = x_sign * x_clamp
2025-05-07T20:32:50.2971546Z         x0 = x[:, :D]
2025-05-07T20:32:50.2971753Z         x1 = x[:, D:]
2025-05-07T20:32:50.2972096Z     
2025-05-07T20:32:50.2972275Z         if contiguous:
2025-05-07T20:32:50.2972593Z             x0 = x0.contiguous()
2025-05-07T20:32:50.2972854Z             x1 = x1.contiguous()
2025-05-07T20:32:50.2973088Z     
2025-05-07T20:32:50.2973274Z         if scale_ub is not None:
2025-05-07T20:32:50.2973618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.2973947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.2974260Z             )
2025-05-07T20:32:50.2974452Z         else:
2025-05-07T20:32:50.2974658Z             scale_ub_tensor = None
2025-05-07T20:32:50.2974910Z     
2025-05-07T20:32:50.2975216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.2975535Z             op = silu_mul_quant
2025-05-07T20:32:50.2975786Z             if compiled:
2025-05-07T20:32:50.2976034Z                 op = torch.compile(op)
2025-05-07T20:32:50.2976333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2976612Z     
2025-05-07T20:32:50.2976805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.2976970Z 
2025-05-07T20:32:50.2977078Z moe/activation_test.py:117: 
2025-05-07T20:32:50.2977374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2977715Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.2978003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2978711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.2979424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.2979977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.2980685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.2981365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.2981915Z     kernel = self.compile(
2025-05-07T20:32:50.2982475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.2983147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.2983557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2983800Z 
2025-05-07T20:32:50.2984014Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb3244fe60>
2025-05-07T20:32:50.2985136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.2986639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c8476a0>}
2025-05-07T20:32:50.2988037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.2989097Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c53bdb0>
2025-05-07T20:32:50.2989396Z 
2025-05-07T20:32:50.2989576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.2990114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.2990588Z                            module_map=module_map)
2025-05-07T20:32:50.2990964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.2991329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.2991589Z E       ^
2025-05-07T20:32:50.2992069Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.2992541Z 
2025-05-07T20:32:50.2993013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.2993543Z 
2025-05-07T20:32:50.2993653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.2994072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.2994530Z     T=128,
2025-05-07T20:32:50.2994721Z     D=5120,
2025-05-07T20:32:50.2994909Z     scale_ub=None,
2025-05-07T20:32:50.2995126Z     contiguous=False,
2025-05-07T20:32:50.2995353Z     compiled=True,
2025-05-07T20:32:50.2995556Z )
2025-05-07T20:32:50.2995884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2996436Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.2996710Z 
2025-05-07T20:32:50.2996793Z     @given(
2025-05-07T20:32:50.2997022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2997344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2997683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2998036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2998372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2998664Z     )
2025-05-07T20:32:50.2999017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2999470Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2999718Z         self,
2025-05-07T20:32:50.2999906Z         T: int,
2025-05-07T20:32:50.3000107Z         D: int,
2025-05-07T20:32:50.3000328Z         scale_ub: Optional[float],
2025-05-07T20:32:50.3000605Z         contiguous: bool,
2025-05-07T20:32:50.3000840Z         compiled: bool,
2025-05-07T20:32:50.3001063Z     ) -> None:
2025-05-07T20:32:50.3001278Z         torch.manual_seed(2025)
2025-05-07T20:32:50.3001516Z     
2025-05-07T20:32:50.3001795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.3002143Z     
2025-05-07T20:32:50.3002334Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.3002632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.3002945Z         x = x_sign * x_clamp
2025-05-07T20:32:50.3003180Z         x0 = x[:, :D]
2025-05-07T20:32:50.3003397Z         x1 = x[:, D:]
2025-05-07T20:32:50.3003613Z     
2025-05-07T20:32:50.3003796Z         if contiguous:
2025-05-07T20:32:50.3004030Z             x0 = x0.contiguous()
2025-05-07T20:32:50.3004294Z             x1 = x1.contiguous()
2025-05-07T20:32:50.3004533Z     
2025-05-07T20:32:50.3004726Z         if scale_ub is not None:
2025-05-07T20:32:50.3005002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.3005339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.3005655Z             )
2025-05-07T20:32:50.3005848Z         else:
2025-05-07T20:32:50.3006111Z             scale_ub_tensor = None
2025-05-07T20:32:50.3006719Z     
2025-05-07T20:32:50.3006956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.3007281Z             op = silu_mul_quant
2025-05-07T20:32:50.3007530Z             if compiled:
2025-05-07T20:32:50.3007812Z                 op = torch.compile(op)
2025-05-07T20:32:50.3008135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3008411Z     
2025-05-07T20:32:50.3008604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.3008769Z 
2025-05-07T20:32:50.3008873Z moe/activation_test.py:117: 
2025-05-07T20:32:50.3009166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3009503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.3009792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3010362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.3010928Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.3011607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.3012497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.3013045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.3013743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.3014484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.3015033Z     kernel = self.compile(
2025-05-07T20:32:50.3015582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.3016320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.3016735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3016970Z 
2025-05-07T20:32:50.3017190Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c8685c0>
2025-05-07T20:32:50.3018373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.3019820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c8451c0>}
2025-05-07T20:32:50.3021230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.3022292Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c5b96b0>
2025-05-07T20:32:50.3022590Z 
2025-05-07T20:32:50.3022766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.3023302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.3023783Z                            module_map=module_map)
2025-05-07T20:32:50.3024157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.3024510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.3024774Z E       ^
2025-05-07T20:32:50.3025251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.3025713Z 
2025-05-07T20:32:50.3026141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.3026687Z 
2025-05-07T20:32:50.3026790Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.3027222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.3027709Z     T=128,
2025-05-07T20:32:50.3027897Z     D=7168,
2025-05-07T20:32:50.3028091Z     scale_ub=1200.0,
2025-05-07T20:32:50.3028318Z     contiguous=False,
2025-05-07T20:32:50.3028544Z     compiled=False,
2025-05-07T20:32:50.3028755Z )
2025-05-07T20:32:50.4169827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4170412Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.4170714Z 
2025-05-07T20:32:50.4170792Z     @given(
2025-05-07T20:32:50.4171026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4171349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4171653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4172070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4172444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4172758Z     )
2025-05-07T20:32:50.4173162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4173678Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4173939Z         self,
2025-05-07T20:32:50.4174141Z         T: int,
2025-05-07T20:32:50.4174618Z         D: int,
2025-05-07T20:32:50.4174841Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4175109Z         contiguous: bool,
2025-05-07T20:32:50.4175346Z         compiled: bool,
2025-05-07T20:32:50.4175654Z     ) -> None:
2025-05-07T20:32:50.4175865Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4176111Z     
2025-05-07T20:32:50.4176390Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4176735Z     
2025-05-07T20:32:50.4176931Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4177254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4177645Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4177891Z         x0 = x[:, :D]
2025-05-07T20:32:50.4178105Z         x1 = x[:, D:]
2025-05-07T20:32:50.4178316Z     
2025-05-07T20:32:50.4178506Z         if contiguous:
2025-05-07T20:32:50.4178733Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4178997Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4179244Z     
2025-05-07T20:32:50.4179433Z         if scale_ub is not None:
2025-05-07T20:32:50.4179899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4180244Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4180559Z             )
2025-05-07T20:32:50.4180754Z         else:
2025-05-07T20:32:50.4180986Z             scale_ub_tensor = None
2025-05-07T20:32:50.4188540Z     
2025-05-07T20:32:50.4188805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4189137Z             op = silu_mul_quant
2025-05-07T20:32:50.4189404Z             if compiled:
2025-05-07T20:32:50.4189648Z                 op = torch.compile(op)
2025-05-07T20:32:50.4189952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4190232Z     
2025-05-07T20:32:50.4190429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4190604Z 
2025-05-07T20:32:50.4190706Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4191120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4191562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4191857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4192576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4193297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4193848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4194558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4195249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4195802Z     kernel = self.compile(
2025-05-07T20:32:50.4196504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4197241Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4197654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4197891Z 
2025-05-07T20:32:50.4198107Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c86aea0>
2025-05-07T20:32:50.4199233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4200683Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ccb82c0>}
2025-05-07T20:32:50.4202091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4203210Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d9db6b0>
2025-05-07T20:32:50.4203509Z 
2025-05-07T20:32:50.4203678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4204259Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4204744Z                            module_map=module_map)
2025-05-07T20:32:50.4205114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4205476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4205789Z E       ^
2025-05-07T20:32:50.4206653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4207126Z 
2025-05-07T20:32:50.4207563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4208103Z 
2025-05-07T20:32:50.4208209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4208638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4209053Z     T=128,
2025-05-07T20:32:50.4209239Z     D=5120,
2025-05-07T20:32:50.4209437Z     scale_ub=None,
2025-05-07T20:32:50.4209657Z     contiguous=False,
2025-05-07T20:32:50.4209884Z     compiled=False,
2025-05-07T20:32:50.4210096Z )
2025-05-07T20:32:50.4210423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4210927Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.4211219Z 
2025-05-07T20:32:50.4211298Z     @given(
2025-05-07T20:32:50.4211534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4211916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4212234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4212574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4212917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4213206Z     )
2025-05-07T20:32:50.4213569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4214031Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4214281Z         self,
2025-05-07T20:32:50.4214486Z         T: int,
2025-05-07T20:32:50.4214698Z         D: int,
2025-05-07T20:32:50.4214926Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4215226Z         contiguous: bool,
2025-05-07T20:32:50.4215487Z         compiled: bool,
2025-05-07T20:32:50.4215726Z     ) -> None:
2025-05-07T20:32:50.4215961Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4216228Z     
2025-05-07T20:32:50.4216524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4216922Z     
2025-05-07T20:32:50.4217209Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4217505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4217828Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4218077Z         x0 = x[:, :D]
2025-05-07T20:32:50.4218301Z         x1 = x[:, D:]
2025-05-07T20:32:50.4218502Z     
2025-05-07T20:32:50.4218694Z         if contiguous:
2025-05-07T20:32:50.4218930Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4219196Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4219440Z     
2025-05-07T20:32:50.4219635Z         if scale_ub is not None:
2025-05-07T20:32:50.4219910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4220248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4220561Z             )
2025-05-07T20:32:50.4220752Z         else:
2025-05-07T20:32:50.4220965Z             scale_ub_tensor = None
2025-05-07T20:32:50.4221218Z     
2025-05-07T20:32:50.4221451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4221773Z             op = silu_mul_quant
2025-05-07T20:32:50.4222034Z             if compiled:
2025-05-07T20:32:50.4222282Z                 op = torch.compile(op)
2025-05-07T20:32:50.4222656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4222941Z     
2025-05-07T20:32:50.4223133Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4223306Z 
2025-05-07T20:32:50.4223474Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4223780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4224119Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4224398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4225112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4225895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4226448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4227157Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4227898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4228448Z     kernel = self.compile(
2025-05-07T20:32:50.4229000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4229687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4230097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4230334Z 
2025-05-07T20:32:50.4230554Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ada90>
2025-05-07T20:32:50.4231671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4233099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c840400>}
2025-05-07T20:32:50.4234500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4235592Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d98d8b0>
2025-05-07T20:32:50.4235894Z 
2025-05-07T20:32:50.4236072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4236618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4237109Z                            module_map=module_map)
2025-05-07T20:32:50.4237565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4237928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4238199Z E       ^
2025-05-07T20:32:50.4238687Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4239163Z 
2025-05-07T20:32:50.4239609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4240156Z 
2025-05-07T20:32:50.4240260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4240692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4241117Z     T=128,
2025-05-07T20:32:50.4241307Z     D=5120,
2025-05-07T20:32:50.4241507Z     scale_ub=1200.0,
2025-05-07T20:32:50.4241734Z     contiguous=True,
2025-05-07T20:32:50.4241954Z     compiled=False,
2025-05-07T20:32:50.4242163Z )
2025-05-07T20:32:50.5969726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5970279Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.5970570Z 
2025-05-07T20:32:50.5970651Z     @given(
2025-05-07T20:32:50.5971133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5971457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5971763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5972254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5972590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5972876Z     )
2025-05-07T20:32:50.5973234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5973692Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5974024Z         self,
2025-05-07T20:32:50.5974228Z         T: int,
2025-05-07T20:32:50.5974434Z         D: int,
2025-05-07T20:32:50.5974653Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5974938Z         contiguous: bool,
2025-05-07T20:32:50.5975190Z         compiled: bool,
2025-05-07T20:32:50.5975428Z     ) -> None:
2025-05-07T20:32:50.5975648Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5975903Z     
2025-05-07T20:32:50.5976188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5976534Z     
2025-05-07T20:32:50.5976737Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5977046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5977364Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5977618Z         x0 = x[:, :D]
2025-05-07T20:32:50.5977846Z         x1 = x[:, D:]
2025-05-07T20:32:50.5978058Z     
2025-05-07T20:32:50.5978254Z         if contiguous:
2025-05-07T20:32:50.5978499Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5978767Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5979015Z     
2025-05-07T20:32:50.5979216Z         if scale_ub is not None:
2025-05-07T20:32:50.5979492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5979838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5980156Z             )
2025-05-07T20:32:50.5980355Z         else:
2025-05-07T20:32:50.5980569Z             scale_ub_tensor = None
2025-05-07T20:32:50.5980828Z     
2025-05-07T20:32:50.5981065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5981386Z             op = silu_mul_quant
2025-05-07T20:32:50.5981646Z             if compiled:
2025-05-07T20:32:50.5981902Z                 op = torch.compile(op)
2025-05-07T20:32:50.5982202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5982486Z     
2025-05-07T20:32:50.5982685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.5982853Z 
2025-05-07T20:32:50.5982959Z moe/activation_test.py:117: 
2025-05-07T20:32:50.5983266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5983611Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.5983982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5984712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.5985440Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.5985995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5986695Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5987384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5987955Z     kernel = self.compile(
2025-05-07T20:32:50.5988513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5989196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5989617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5989853Z 
2025-05-07T20:32:50.5990079Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5af560>
2025-05-07T20:32:50.5991241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5992726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c841300>}
2025-05-07T20:32:50.5994119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5995224Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c4ddeb0>
2025-05-07T20:32:50.5995522Z 
2025-05-07T20:32:50.5995702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5996238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5996723Z                            module_map=module_map)
2025-05-07T20:32:50.5997123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5997489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.5997755Z E       ^
2025-05-07T20:32:50.5998235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5998704Z 
2025-05-07T20:32:50.5999131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5999667Z 
2025-05-07T20:32:50.5999784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6000207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6000628Z     T=1,
2025-05-07T20:32:50.6000820Z     D=7168,
2025-05-07T20:32:50.6001016Z     scale_ub=1200.0,
2025-05-07T20:32:50.6001249Z     contiguous=True,
2025-05-07T20:32:50.6001483Z     compiled=True,
2025-05-07T20:32:50.6001690Z )
2025-05-07T20:32:50.6002021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.6002529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.6002799Z 
2025-05-07T20:32:50.6002886Z     @given(
2025-05-07T20:32:50.6003120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.6003440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.6003755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.6004090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.6004434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.6004730Z     )
2025-05-07T20:32:50.6005137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.6005597Z     def test_silu_mul_quant(
2025-05-07T20:32:50.6005847Z         self,
2025-05-07T20:32:50.6006048Z         T: int,
2025-05-07T20:32:50.6006509Z         D: int,
2025-05-07T20:32:50.6006824Z         scale_ub: Optional[float],
2025-05-07T20:32:50.6007128Z         contiguous: bool,
2025-05-07T20:32:50.6007376Z         compiled: bool,
2025-05-07T20:32:50.6007613Z     ) -> None:
2025-05-07T20:32:50.6007837Z         torch.manual_seed(2025)
2025-05-07T20:32:50.6008083Z     
2025-05-07T20:32:50.6008365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.6008718Z     
2025-05-07T20:32:50.6008914Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.6009217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.6009543Z         x = x_sign * x_clamp
2025-05-07T20:32:50.6009786Z         x0 = x[:, :D]
2025-05-07T20:32:50.6010017Z         x1 = x[:, D:]
2025-05-07T20:32:50.6010232Z     
2025-05-07T20:32:50.6010428Z         if contiguous:
2025-05-07T20:32:50.6010669Z             x0 = x0.contiguous()
2025-05-07T20:32:50.6010936Z             x1 = x1.contiguous()
2025-05-07T20:32:50.6011261Z     
2025-05-07T20:32:50.6011465Z         if scale_ub is not None:
2025-05-07T20:32:50.6011746Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.6012190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.6012571Z             )
2025-05-07T20:32:50.6012772Z         else:
2025-05-07T20:32:50.6012994Z             scale_ub_tensor = None
2025-05-07T20:32:50.6013245Z     
2025-05-07T20:32:50.6013489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.6013815Z             op = silu_mul_quant
2025-05-07T20:32:50.6014134Z             if compiled:
2025-05-07T20:32:50.6014386Z                 op = torch.compile(op)
2025-05-07T20:32:50.6014689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6014967Z     
2025-05-07T20:32:50.6015173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.6015343Z 
2025-05-07T20:32:50.6015452Z moe/activation_test.py:117: 
2025-05-07T20:32:50.6015752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6016096Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.6016387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6016959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.6017534Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.6018217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.6018927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.6019477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.6020183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.6020871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.6021432Z     kernel = self.compile(
2025-05-07T20:32:50.6021984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.6022667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.6023081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6023317Z 
2025-05-07T20:32:50.6023538Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ae720>
2025-05-07T20:32:50.6024653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.6026155Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c842ac0>}
2025-05-07T20:32:50.6027554Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.6028619Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c4433b0>
2025-05-07T20:32:50.6028917Z 
2025-05-07T20:32:50.6029088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.6029628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.6030113Z                            module_map=module_map)
2025-05-07T20:32:50.6030491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.6030849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.6031119Z E       ^
2025-05-07T20:32:50.6031604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.6032072Z 
2025-05-07T20:32:50.6032550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.6033090Z 
2025-05-07T20:32:50.6033199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6033669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6034088Z     T=1,
2025-05-07T20:32:50.6034275Z     D=7168,
2025-05-07T20:32:50.6034480Z     scale_ub=1200.0,
2025-05-07T20:32:50.6034715Z     contiguous=False,
2025-05-07T20:32:50.6034944Z     compiled=True,
2025-05-07T20:32:50.6035230Z )
2025-05-07T20:32:50.7363726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7364798Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.7365344Z 
2025-05-07T20:32:50.7365527Z     @given(
2025-05-07T20:32:50.7365978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7366617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7367240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7367782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7368113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7368407Z     )
2025-05-07T20:32:50.7368758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7369210Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7369457Z         self,
2025-05-07T20:32:50.7369648Z         T: int,
2025-05-07T20:32:50.7369848Z         D: int,
2025-05-07T20:32:50.7370079Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7370348Z         contiguous: bool,
2025-05-07T20:32:50.7370599Z         compiled: bool,
2025-05-07T20:32:50.7370829Z     ) -> None:
2025-05-07T20:32:50.7371054Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7371296Z     
2025-05-07T20:32:50.7371571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7371981Z     
2025-05-07T20:32:50.7372173Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.7372468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.7372783Z         x = x_sign * x_clamp
2025-05-07T20:32:50.7373028Z         x0 = x[:, :D]
2025-05-07T20:32:50.7373253Z         x1 = x[:, D:]
2025-05-07T20:32:50.7373466Z     
2025-05-07T20:32:50.7373651Z         if contiguous:
2025-05-07T20:32:50.7373889Z             x0 = x0.contiguous()
2025-05-07T20:32:50.7374156Z             x1 = x1.contiguous()
2025-05-07T20:32:50.7374391Z     
2025-05-07T20:32:50.7374591Z         if scale_ub is not None:
2025-05-07T20:32:50.7374872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.7375207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.7375529Z             )
2025-05-07T20:32:50.7375979Z         else:
2025-05-07T20:32:50.7376198Z             scale_ub_tensor = None
2025-05-07T20:32:50.7376459Z     
2025-05-07T20:32:50.7376701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.7377032Z             op = silu_mul_quant
2025-05-07T20:32:50.7377286Z             if compiled:
2025-05-07T20:32:50.7377554Z                 op = torch.compile(op)
2025-05-07T20:32:50.7377902Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7378179Z     
2025-05-07T20:32:50.7378380Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.7378550Z 
2025-05-07T20:32:50.7378660Z moe/activation_test.py:117: 
2025-05-07T20:32:50.7378964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7379311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.7379602Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7380183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.7380758Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.7381519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.7382230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.7382777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.7383553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.7384240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.7384790Z     kernel = self.compile(
2025-05-07T20:32:50.7385416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.7386094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.7386510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7386747Z 
2025-05-07T20:32:50.7386967Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ad460>
2025-05-07T20:32:50.7388095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.7389541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c86f9c0>}
2025-05-07T20:32:50.7390936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.7392006Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c7d95f0>
2025-05-07T20:32:50.7392304Z 
2025-05-07T20:32:50.7392474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.7393024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.7393505Z                            module_map=module_map)
2025-05-07T20:32:50.7393882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.7394246Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.7394514Z E       ^
2025-05-07T20:32:50.7394994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.7395459Z 
2025-05-07T20:32:50.7395889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7396430Z 
2025-05-07T20:32:50.7396537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7397013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7397433Z     T=1,
2025-05-07T20:32:50.7397629Z     D=7168,
2025-05-07T20:32:50.7397830Z     scale_ub=None,
2025-05-07T20:32:50.7398059Z     contiguous=False,
2025-05-07T20:32:50.7398303Z     compiled=True,
2025-05-07T20:32:50.7405512Z )
2025-05-07T20:32:50.8268548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8269312Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.8269618Z 
2025-05-07T20:32:50.8269703Z     @given(
2025-05-07T20:32:50.8269945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8270265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8270572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8270921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8271255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8271535Z     )
2025-05-07T20:32:50.8271898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8272358Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8272804Z         self,
2025-05-07T20:32:50.8273005Z         T: int,
2025-05-07T20:32:50.8273207Z         D: int,
2025-05-07T20:32:50.8273434Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8273706Z         contiguous: bool,
2025-05-07T20:32:50.8274040Z         compiled: bool,
2025-05-07T20:32:50.8274277Z     ) -> None:
2025-05-07T20:32:50.8274493Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8274747Z     
2025-05-07T20:32:50.8275033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8275384Z     
2025-05-07T20:32:50.8275589Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8275968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8276281Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8276526Z         x0 = x[:, :D]
2025-05-07T20:32:50.8276752Z         x1 = x[:, D:]
2025-05-07T20:32:50.8276957Z     
2025-05-07T20:32:50.8277148Z         if contiguous:
2025-05-07T20:32:50.8277385Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8277650Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8277890Z     
2025-05-07T20:32:50.8278084Z         if scale_ub is not None:
2025-05-07T20:32:50.8278361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8278703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8279025Z             )
2025-05-07T20:32:50.8279225Z         else:
2025-05-07T20:32:50.8279436Z             scale_ub_tensor = None
2025-05-07T20:32:50.8279695Z     
2025-05-07T20:32:50.8279932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8280255Z             op = silu_mul_quant
2025-05-07T20:32:50.8280512Z             if compiled:
2025-05-07T20:32:50.8280767Z                 op = torch.compile(op)
2025-05-07T20:32:50.8281065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8281355Z     
2025-05-07T20:32:50.8281553Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.8281842Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.8282151Z     
2025-05-07T20:32:50.8282392Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8282737Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.8283034Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.8283359Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.8283731Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.8284042Z     
2025-05-07T20:32:50.8284250Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.8284450Z 
2025-05-07T20:32:50.8284563Z moe/activation_test.py:126: 
2025-05-07T20:32:50.8284866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8285215Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.8285654Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.8286484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.8287260Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.8287830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.8288542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.8289254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.8290004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.8290765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.8291434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.8292141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.8292740Z     fn()
2025-05-07T20:32:50.8293280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.8293886Z     self.fn.run(
2025-05-07T20:32:50.8294412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.8294969Z     kernel = self.compile(
2025-05-07T20:32:50.8295530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.8296207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8296675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8296920Z 
2025-05-07T20:32:50.8297142Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906f30>
2025-05-07T20:32:50.8298277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.8299725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92cb80>}
2025-05-07T20:32:50.8301268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.8302351Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d965430>
2025-05-07T20:32:50.8302652Z 
2025-05-07T20:32:50.8302830Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.8303379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8303862Z                            module_map=module_map)
2025-05-07T20:32:50.8304243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8304619Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.8304890Z E       ^
2025-05-07T20:32:50.8305373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8305846Z 
2025-05-07T20:32:50.8306594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.8307136Z 
2025-05-07T20:32:50.8307249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8307685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8308109Z     T=1,
2025-05-07T20:32:50.8308299Z     D=5120,
2025-05-07T20:32:50.8308590Z     scale_ub=1200.0,
2025-05-07T20:32:50.8308827Z     contiguous=False,
2025-05-07T20:32:50.8309065Z     compiled=True,
2025-05-07T20:32:50.8309273Z )
2025-05-07T20:32:50.9859160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9859966Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.9860354Z 
2025-05-07T20:32:50.9860463Z     @given(
2025-05-07T20:32:50.9860790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9861112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9861432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9861772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9862101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9862402Z     )
2025-05-07T20:32:50.9862765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9863222Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9863482Z         self,
2025-05-07T20:32:50.9863686Z         T: int,
2025-05-07T20:32:50.9863886Z         D: int,
2025-05-07T20:32:50.9864111Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9864575Z         contiguous: bool,
2025-05-07T20:32:50.9864826Z         compiled: bool,
2025-05-07T20:32:50.9865066Z     ) -> None:
2025-05-07T20:32:50.9865292Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9865651Z     
2025-05-07T20:32:50.9865925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9866280Z     
2025-05-07T20:32:50.9866479Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9866774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9867093Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9867421Z         x0 = x[:, :D]
2025-05-07T20:32:50.9867666Z         x1 = x[:, D:]
2025-05-07T20:32:50.9867907Z     
2025-05-07T20:32:50.9868098Z         if contiguous:
2025-05-07T20:32:50.9868334Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9868606Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9868854Z     
2025-05-07T20:32:50.9869045Z         if scale_ub is not None:
2025-05-07T20:32:50.9869326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9869672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9869985Z             )
2025-05-07T20:32:50.9870186Z         else:
2025-05-07T20:32:50.9870410Z             scale_ub_tensor = None
2025-05-07T20:32:50.9870664Z     
2025-05-07T20:32:50.9870903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9871228Z             op = silu_mul_quant
2025-05-07T20:32:50.9871489Z             if compiled:
2025-05-07T20:32:50.9871740Z                 op = torch.compile(op)
2025-05-07T20:32:50.9872051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9872334Z     
2025-05-07T20:32:50.9872528Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9872709Z 
2025-05-07T20:32:50.9872814Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9873123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9873464Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9873758Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9874343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.9874933Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.9875613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9876331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9876888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9877595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9878372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9878929Z     kernel = self.compile(
2025-05-07T20:32:50.9879498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9880172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9880589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9880831Z 
2025-05-07T20:32:50.9881053Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906900>
2025-05-07T20:32:50.9882183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9883630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92de40>}
2025-05-07T20:32:50.9885080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9886149Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1dd6f0>
2025-05-07T20:32:50.9886490Z 
2025-05-07T20:32:50.9886669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9887207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9887690Z                            module_map=module_map)
2025-05-07T20:32:50.9888072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9888488Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9888754Z E       ^
2025-05-07T20:32:50.9889239Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9889708Z 
2025-05-07T20:32:50.9890149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9890683Z 
2025-05-07T20:32:50.9890795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9891222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9891644Z     T=1,
2025-05-07T20:32:50.9891919Z     D=5120,
2025-05-07T20:32:50.9892116Z     scale_ub=1200.0,
2025-05-07T20:32:50.9892364Z     contiguous=False,
2025-05-07T20:32:50.9892594Z     compiled=False,
2025-05-07T20:32:50.9892811Z )
2025-05-07T20:32:50.9893147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9893667Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.9893946Z 
2025-05-07T20:32:50.9894028Z     @given(
2025-05-07T20:32:50.9894274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9894601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9894916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9895267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9895610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9895899Z     )
2025-05-07T20:32:50.9896264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9896729Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9896987Z         self,
2025-05-07T20:32:50.9897189Z         T: int,
2025-05-07T20:32:50.9897398Z         D: int,
2025-05-07T20:32:50.9897626Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9897902Z         contiguous: bool,
2025-05-07T20:32:50.9898151Z         compiled: bool,
2025-05-07T20:32:50.9898381Z     ) -> None:
2025-05-07T20:32:50.9898597Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9898850Z     
2025-05-07T20:32:50.9899185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9899533Z     
2025-05-07T20:32:50.9899736Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9900035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9900353Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9900603Z         x0 = x[:, :D]
2025-05-07T20:32:50.9900830Z         x1 = x[:, D:]
2025-05-07T20:32:50.9901043Z     
2025-05-07T20:32:50.9901235Z         if contiguous:
2025-05-07T20:32:50.9901478Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9901739Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9901986Z     
2025-05-07T20:32:50.9902192Z         if scale_ub is not None:
2025-05-07T20:32:50.9902471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9902813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9903130Z             )
2025-05-07T20:32:50.9903330Z         else:
2025-05-07T20:32:50.9903541Z             scale_ub_tensor = None
2025-05-07T20:32:50.9903809Z     
2025-05-07T20:32:50.9904047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9904370Z             op = silu_mul_quant
2025-05-07T20:32:50.9904681Z             if compiled:
2025-05-07T20:32:50.9904941Z                 op = torch.compile(op)
2025-05-07T20:32:50.9905244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9905533Z     
2025-05-07T20:32:50.9905777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9905949Z 
2025-05-07T20:32:50.9906052Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9906627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9906975Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9907270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9908076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9908809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9909377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9910096Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9910799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9911365Z     kernel = self.compile(
2025-05-07T20:32:50.9911937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9912621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9913047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9913290Z 
2025-05-07T20:32:50.9913514Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906060>
2025-05-07T20:32:50.9914645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9916075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92eac0>}
2025-05-07T20:32:50.9917480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9918550Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c0b6c30>
2025-05-07T20:32:50.9918849Z 
2025-05-07T20:32:50.9919029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9919569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9920128Z                            module_map=module_map)
2025-05-07T20:32:50.9920507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9920878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9921141Z E       ^
2025-05-07T20:32:50.9921624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9922096Z 
2025-05-07T20:32:50.9922535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9923067Z 
2025-05-07T20:32:50.9923179Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9923604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9924026Z     T=16384,
2025-05-07T20:32:50.9924228Z     D=5120,
2025-05-07T20:32:50.9924422Z     scale_ub=1200.0,
2025-05-07T20:32:50.9924659Z     contiguous=False,
2025-05-07T20:32:50.9924895Z     compiled=True,
2025-05-07T20:32:50.9925102Z )
2025-05-07T20:32:51.0795840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0796889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.0797287Z 
2025-05-07T20:32:51.0797396Z     @given(
2025-05-07T20:32:51.0797699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0798072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0798467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0798796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0799129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0799418Z     )
2025-05-07T20:32:51.0799769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0800304Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0800555Z         self,
2025-05-07T20:32:51.0800756Z         T: int,
2025-05-07T20:32:51.0800951Z         D: int,
2025-05-07T20:32:51.0801181Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0801460Z         contiguous: bool,
2025-05-07T20:32:51.0801696Z         compiled: bool,
2025-05-07T20:32:51.0801930Z     ) -> None:
2025-05-07T20:32:51.0802150Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0802391Z     
2025-05-07T20:32:51.0802668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0803027Z     
2025-05-07T20:32:51.0803218Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0803517Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0803835Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0804075Z         x0 = x[:, :D]
2025-05-07T20:32:51.0804296Z         x1 = x[:, D:]
2025-05-07T20:32:51.0804511Z     
2025-05-07T20:32:51.0804696Z         if contiguous:
2025-05-07T20:32:51.0804932Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0805194Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0805432Z     
2025-05-07T20:32:51.0805628Z         if scale_ub is not None:
2025-05-07T20:32:51.0805905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0806522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0806834Z             )
2025-05-07T20:32:51.0807034Z         else:
2025-05-07T20:32:51.0807246Z             scale_ub_tensor = None
2025-05-07T20:32:51.0807494Z     
2025-05-07T20:32:51.0807731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0808054Z             op = silu_mul_quant
2025-05-07T20:32:51.0808302Z             if compiled:
2025-05-07T20:32:51.0808553Z                 op = torch.compile(op)
2025-05-07T20:32:51.0808857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0809161Z     
2025-05-07T20:32:51.0809353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0809526Z 
2025-05-07T20:32:51.0809626Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0809930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0810361Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0810649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0811226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0811883Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0812560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0813283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0813840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0814547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0815231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0815789Z     kernel = self.compile(
2025-05-07T20:32:51.0816354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0817132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0817551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0817795Z 
2025-05-07T20:32:51.0818010Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6b8410>
2025-05-07T20:32:51.0819190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0820688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c668180>}
2025-05-07T20:32:51.0822091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0823161Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c6d8af0>
2025-05-07T20:32:51.0823458Z 
2025-05-07T20:32:51.0823634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0824179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0824655Z                            module_map=module_map)
2025-05-07T20:32:51.0825030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0825395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0825657Z E       ^
2025-05-07T20:32:51.0826138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0826604Z 
2025-05-07T20:32:51.0827042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0827579Z 
2025-05-07T20:32:51.0827710Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0828160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0828576Z     T=2048,
2025-05-07T20:32:51.0828768Z     D=7168,
2025-05-07T20:32:51.0828958Z     scale_ub=1200.0,
2025-05-07T20:32:51.0829189Z     contiguous=False,
2025-05-07T20:32:51.0829419Z     compiled=True,
2025-05-07T20:32:51.0829623Z )
2025-05-07T20:32:51.0829952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0830463Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.0830748Z 
2025-05-07T20:32:51.0830833Z     @given(
2025-05-07T20:32:51.0831062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0831385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0831749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0832081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0832417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0832707Z     )
2025-05-07T20:32:51.0833060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0833514Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0833762Z         self,
2025-05-07T20:32:51.0833956Z         T: int,
2025-05-07T20:32:51.0834156Z         D: int,
2025-05-07T20:32:51.0834376Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0834649Z         contiguous: bool,
2025-05-07T20:32:51.0834893Z         compiled: bool,
2025-05-07T20:32:51.0835123Z     ) -> None:
2025-05-07T20:32:51.0835345Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0835586Z     
2025-05-07T20:32:51.0835867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0836219Z     
2025-05-07T20:32:51.0836430Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0844115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0844443Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0844779Z         x0 = x[:, :D]
2025-05-07T20:32:51.0845011Z         x1 = x[:, D:]
2025-05-07T20:32:51.0845226Z     
2025-05-07T20:32:51.0845411Z         if contiguous:
2025-05-07T20:32:51.0845652Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0845985Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0846324Z     
2025-05-07T20:32:51.0846574Z         if scale_ub is not None:
2025-05-07T20:32:51.0846855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0847198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0847584Z             )
2025-05-07T20:32:51.0847800Z         else:
2025-05-07T20:32:51.0848045Z             scale_ub_tensor = None
2025-05-07T20:32:51.0848305Z     
2025-05-07T20:32:51.0848544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0848866Z             op = silu_mul_quant
2025-05-07T20:32:51.0849130Z             if compiled:
2025-05-07T20:32:51.0849381Z                 op = torch.compile(op)
2025-05-07T20:32:51.0849677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0849959Z     
2025-05-07T20:32:51.0850154Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0850322Z 
2025-05-07T20:32:51.0850428Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0850731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0851073Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0851362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0852008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0852590Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0853285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0853995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0854547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0855257Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0855942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0856491Z     kernel = self.compile(
2025-05-07T20:32:51.0857053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0857759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0858201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0858436Z 
2025-05-07T20:32:51.0858652Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6b90d0>
2025-05-07T20:32:51.0859828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0861258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c668ea0>}
2025-05-07T20:32:51.0862657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0863726Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c362030>
2025-05-07T20:32:51.0864027Z 
2025-05-07T20:32:51.0864199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0864745Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0865228Z                            module_map=module_map)
2025-05-07T20:32:51.0865645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0866010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0866279Z E       ^
2025-05-07T20:32:51.0866755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0867269Z 
2025-05-07T20:32:51.0867698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0868236Z 
2025-05-07T20:32:51.2017638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2018552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2018987Z     T=1,
2025-05-07T20:32:51.2019179Z     D=5120,
2025-05-07T20:32:51.2019365Z     scale_ub=None,
2025-05-07T20:32:51.2019584Z     contiguous=False,
2025-05-07T20:32:51.2019821Z     compiled=False,
2025-05-07T20:32:51.2020024Z )
2025-05-07T20:32:51.2020349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2020859Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.2021123Z 
2025-05-07T20:32:51.2021205Z     @given(
2025-05-07T20:32:51.2021431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2021754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2022066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2022394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2022723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2023011Z     )
2025-05-07T20:32:51.2023362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2023812Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2024055Z         self,
2025-05-07T20:32:51.2024250Z         T: int,
2025-05-07T20:32:51.2024450Z         D: int,
2025-05-07T20:32:51.2024671Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2024947Z         contiguous: bool,
2025-05-07T20:32:51.2025191Z         compiled: bool,
2025-05-07T20:32:51.2025422Z     ) -> None:
2025-05-07T20:32:51.2025639Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2025881Z     
2025-05-07T20:32:51.2026159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2026508Z     
2025-05-07T20:32:51.2026701Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2026994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2027313Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2027548Z         x0 = x[:, :D]
2025-05-07T20:32:51.2027771Z         x1 = x[:, D:]
2025-05-07T20:32:51.2027980Z     
2025-05-07T20:32:51.2028167Z         if contiguous:
2025-05-07T20:32:51.2028402Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2028664Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2028998Z     
2025-05-07T20:32:51.2029194Z         if scale_ub is not None:
2025-05-07T20:32:51.2029470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2029805Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2030121Z             )
2025-05-07T20:32:51.2030318Z         else:
2025-05-07T20:32:51.2030532Z             scale_ub_tensor = None
2025-05-07T20:32:51.2030788Z     
2025-05-07T20:32:51.2031030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2031354Z             op = silu_mul_quant
2025-05-07T20:32:51.2031605Z             if compiled:
2025-05-07T20:32:51.2031857Z                 op = torch.compile(op)
2025-05-07T20:32:51.2032159Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2032434Z     
2025-05-07T20:32:51.2032633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2032800Z 
2025-05-07T20:32:51.2032906Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2033206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2033546Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2033835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2034630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2035339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2035960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2036663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2037336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2037922Z     kernel = self.compile(
2025-05-07T20:32:51.2038480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2039157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2039561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2039800Z 
2025-05-07T20:32:51.2040013Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6bb5f0>
2025-05-07T20:32:51.2041126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2042561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c669e40>}
2025-05-07T20:32:51.2043950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2045012Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c73abf0>
2025-05-07T20:32:51.2045313Z 
2025-05-07T20:32:51.2045482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2046020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2046494Z                            module_map=module_map)
2025-05-07T20:32:51.2046864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2047224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2047488Z E       ^
2025-05-07T20:32:51.2047958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2048427Z 
2025-05-07T20:32:51.2048853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.2049382Z 
2025-05-07T20:32:51.2049593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2050023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2050431Z     T=4096,
2025-05-07T20:32:51.2050625Z     D=7168,
2025-05-07T20:32:51.2050824Z     scale_ub=1200.0,
2025-05-07T20:32:51.2051047Z     contiguous=False,
2025-05-07T20:32:51.2051281Z     compiled=False,
2025-05-07T20:32:51.2051494Z )
2025-05-07T20:32:51.2051890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2052405Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.2052689Z 
2025-05-07T20:32:51.2052774Z     @given(
2025-05-07T20:32:51.2053002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2053327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2053642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2053979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2054314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2054608Z     )
2025-05-07T20:32:51.2055014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2055466Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2055714Z         self,
2025-05-07T20:32:51.2055913Z         T: int,
2025-05-07T20:32:51.2056110Z         D: int,
2025-05-07T20:32:51.2056380Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2056657Z         contiguous: bool,
2025-05-07T20:32:51.2056900Z         compiled: bool,
2025-05-07T20:32:51.2057127Z     ) -> None:
2025-05-07T20:32:51.2057348Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2057589Z     
2025-05-07T20:32:51.2057862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2058254Z     
2025-05-07T20:32:51.2058444Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2058739Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2059054Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2059297Z         x0 = x[:, :D]
2025-05-07T20:32:51.2059511Z         x1 = x[:, D:]
2025-05-07T20:32:51.2059720Z     
2025-05-07T20:32:51.2059912Z         if contiguous:
2025-05-07T20:32:51.2060138Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2060398Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2060639Z     
2025-05-07T20:32:51.2060829Z         if scale_ub is not None:
2025-05-07T20:32:51.2061103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2061443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2061748Z             )
2025-05-07T20:32:51.2061946Z         else:
2025-05-07T20:32:51.2062160Z             scale_ub_tensor = None
2025-05-07T20:32:51.2062412Z     
2025-05-07T20:32:51.2062648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2062965Z             op = silu_mul_quant
2025-05-07T20:32:51.2063212Z             if compiled:
2025-05-07T20:32:51.2063467Z                 op = torch.compile(op)
2025-05-07T20:32:51.2063783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2064055Z     
2025-05-07T20:32:51.2064257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2064425Z 
2025-05-07T20:32:51.2064532Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2064838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2065174Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2065464Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2066171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2066873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2067428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2068127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2068860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2069403Z     kernel = self.compile(
2025-05-07T20:32:51.2069958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2070637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2071041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2071285Z 
2025-05-07T20:32:51.2071498Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6bad20>
2025-05-07T20:32:51.2072613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2074040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c66b380>}
2025-05-07T20:32:51.2075475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2076528Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c394970>
2025-05-07T20:32:51.2076867Z 
2025-05-07T20:32:51.2077036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2077571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2078079Z                            module_map=module_map)
2025-05-07T20:32:51.2078513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2078872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2079140Z E       ^
2025-05-07T20:32:51.2079616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2080085Z 
2025-05-07T20:32:51.2080514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.2081050Z 
2025-05-07T20:32:51.2081154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2081579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2081988Z     T=16384,
2025-05-07T20:32:51.2082190Z     D=7168,
2025-05-07T20:32:51.2082390Z     scale_ub=None,
2025-05-07T20:32:51.2082602Z     contiguous=True,
2025-05-07T20:32:51.2082830Z     compiled=True,
2025-05-07T20:32:51.2083039Z )
2025-05-07T20:32:51.3831440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3832254Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3832549Z 
2025-05-07T20:32:51.3832654Z     @given(
2025-05-07T20:32:51.3832890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3833207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3833517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3833851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3834185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3834477Z     )
2025-05-07T20:32:51.3834827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3835285Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3835531Z         self,
2025-05-07T20:32:51.3835725Z         T: int,
2025-05-07T20:32:51.3835926Z         D: int,
2025-05-07T20:32:51.3836153Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3836420Z         contiguous: bool,
2025-05-07T20:32:51.3836663Z         compiled: bool,
2025-05-07T20:32:51.3836897Z     ) -> None:
2025-05-07T20:32:51.3837427Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3837702Z     
2025-05-07T20:32:51.3837979Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3838323Z     
2025-05-07T20:32:51.3838531Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3838828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3839146Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3839387Z         x0 = x[:, :D]
2025-05-07T20:32:51.3839614Z         x1 = x[:, D:]
2025-05-07T20:32:51.3839826Z     
2025-05-07T20:32:51.3840011Z         if contiguous:
2025-05-07T20:32:51.3840244Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3840504Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3840741Z     
2025-05-07T20:32:51.3840931Z         if scale_ub is not None:
2025-05-07T20:32:51.3841209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3841542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3841855Z             )
2025-05-07T20:32:51.3842053Z         else:
2025-05-07T20:32:51.3842258Z             scale_ub_tensor = None
2025-05-07T20:32:51.3842511Z     
2025-05-07T20:32:51.3842829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3843145Z             op = silu_mul_quant
2025-05-07T20:32:51.3843397Z             if compiled:
2025-05-07T20:32:51.3843648Z                 op = torch.compile(op)
2025-05-07T20:32:51.3844024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3844299Z     
2025-05-07T20:32:51.3844490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3844657Z 
2025-05-07T20:32:51.3844759Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3845054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3845474Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3845761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3846334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3846917Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3847600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3848357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3848902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3849608Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3850298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3850841Z     kernel = self.compile(
2025-05-07T20:32:51.3851407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3852178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3852591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3852827Z 
2025-05-07T20:32:51.3853044Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c8800>
2025-05-07T20:32:51.3854160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3855604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e04a0>}
2025-05-07T20:32:51.3856998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3858063Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c033470>
2025-05-07T20:32:51.3858407Z 
2025-05-07T20:32:51.3858577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3859115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3859596Z                            module_map=module_map)
2025-05-07T20:32:51.3859961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3860329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3860594Z E       ^
2025-05-07T20:32:51.3861071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3861537Z 
2025-05-07T20:32:51.3861966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3862510Z 
2025-05-07T20:32:51.3862614Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3863046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3863461Z     T=4096,
2025-05-07T20:32:51.3863646Z     D=5120,
2025-05-07T20:32:51.3863840Z     scale_ub=None,
2025-05-07T20:32:51.3864105Z     contiguous=False,
2025-05-07T20:32:51.3864328Z     compiled=True,
2025-05-07T20:32:51.3864535Z )
2025-05-07T20:32:51.3864861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3865407Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3865693Z 
2025-05-07T20:32:51.3865773Z     @given(
2025-05-07T20:32:51.3866008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3866322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3866632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3867034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3867364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3867655Z     )
2025-05-07T20:32:51.3868017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3868475Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3868715Z         self,
2025-05-07T20:32:51.3868917Z         T: int,
2025-05-07T20:32:51.3869120Z         D: int,
2025-05-07T20:32:51.3869337Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3869618Z         contiguous: bool,
2025-05-07T20:32:51.3869870Z         compiled: bool,
2025-05-07T20:32:51.3870092Z     ) -> None:
2025-05-07T20:32:51.3870311Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3870558Z     
2025-05-07T20:32:51.3870828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3871182Z     
2025-05-07T20:32:51.3871381Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3871679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3871998Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3872246Z         x0 = x[:, :D]
2025-05-07T20:32:51.3872462Z         x1 = x[:, D:]
2025-05-07T20:32:51.3872677Z     
2025-05-07T20:32:51.3872869Z         if contiguous:
2025-05-07T20:32:51.3873109Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3873369Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3873617Z     
2025-05-07T20:32:51.3873811Z         if scale_ub is not None:
2025-05-07T20:32:51.3874081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3874423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3874743Z             )
2025-05-07T20:32:51.3874936Z         else:
2025-05-07T20:32:51.3875151Z             scale_ub_tensor = None
2025-05-07T20:32:51.3875407Z     
2025-05-07T20:32:51.3875636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3875960Z             op = silu_mul_quant
2025-05-07T20:32:51.3876215Z             if compiled:
2025-05-07T20:32:51.3876457Z                 op = torch.compile(op)
2025-05-07T20:32:51.3876760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3877092Z     
2025-05-07T20:32:51.3877289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3877490Z 
2025-05-07T20:32:51.3877606Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3877907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3878244Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3878522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3879113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3879693Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3880372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3881083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3881635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3882349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3883081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3883631Z     kernel = self.compile(
2025-05-07T20:32:51.3884190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3884917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3885319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3885556Z 
2025-05-07T20:32:51.3885769Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c8350>
2025-05-07T20:32:51.3886925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3888353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e11c0>}
2025-05-07T20:32:51.3889742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3890813Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ce3cfb0>
2025-05-07T20:32:51.3891116Z 
2025-05-07T20:32:51.3891286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3891877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3892371Z                            module_map=module_map)
2025-05-07T20:32:51.3892744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3893109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3893379Z E       ^
2025-05-07T20:32:51.3901777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3902286Z 
2025-05-07T20:32:51.3902722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3903252Z 
2025-05-07T20:32:51.6981019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6981581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6982089Z     T=4096,
2025-05-07T20:32:51.6982281Z     D=5120,
2025-05-07T20:32:51.6982476Z     scale_ub=1200.0,
2025-05-07T20:32:51.6982700Z     contiguous=False,
2025-05-07T20:32:51.6982950Z     compiled=False,
2025-05-07T20:32:51.6983165Z )
2025-05-07T20:32:51.6983489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6984295Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.6984594Z 
2025-05-07T20:32:51.6984676Z     @given(
2025-05-07T20:32:51.6984914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6985241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6985556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6985897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6986236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6986532Z     )
2025-05-07T20:32:51.6986901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6987353Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6987607Z         self,
2025-05-07T20:32:51.6987812Z         T: int,
2025-05-07T20:32:51.6988010Z         D: int,
2025-05-07T20:32:51.6988237Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6988515Z         contiguous: bool,
2025-05-07T20:32:51.6988759Z         compiled: bool,
2025-05-07T20:32:51.6989013Z     ) -> None:
2025-05-07T20:32:51.6989238Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6989494Z     
2025-05-07T20:32:51.6989863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6990227Z     
2025-05-07T20:32:51.6990436Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.6990730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.6991133Z         x = x_sign * x_clamp
2025-05-07T20:32:51.6991389Z         x0 = x[:, :D]
2025-05-07T20:32:51.6991612Z         x1 = x[:, D:]
2025-05-07T20:32:51.6991832Z     
2025-05-07T20:32:51.6992036Z         if contiguous:
2025-05-07T20:32:51.6992273Z             x0 = x0.contiguous()
2025-05-07T20:32:51.6992545Z             x1 = x1.contiguous()
2025-05-07T20:32:51.6992875Z     
2025-05-07T20:32:51.6993072Z         if scale_ub is not None:
2025-05-07T20:32:51.6993357Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.6993705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.6994022Z             )
2025-05-07T20:32:51.6994228Z         else:
2025-05-07T20:32:51.6994449Z             scale_ub_tensor = None
2025-05-07T20:32:51.6994708Z     
2025-05-07T20:32:51.6994945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.6995272Z             op = silu_mul_quant
2025-05-07T20:32:51.6995531Z             if compiled:
2025-05-07T20:32:51.6995784Z                 op = torch.compile(op)
2025-05-07T20:32:51.6996092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.6996376Z     
2025-05-07T20:32:51.6996572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.6996746Z 
2025-05-07T20:32:51.6996850Z moe/activation_test.py:117: 
2025-05-07T20:32:51.6997156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.6997498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.6997792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.6998515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.6999235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.6999789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7000496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7001187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7001735Z     kernel = self.compile(
2025-05-07T20:32:51.7002297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7002980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7003395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7003633Z 
2025-05-07T20:32:51.7003903Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0cba10>
2025-05-07T20:32:51.7005032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7006747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e2160>}
2025-05-07T20:32:51.7008160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7009233Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c2a4af0>
2025-05-07T20:32:51.7009532Z 
2025-05-07T20:32:51.7009704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7010254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7010830Z                            module_map=module_map)
2025-05-07T20:32:51.7011214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7011586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7011937Z E       ^
2025-05-07T20:32:51.7012492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7012961Z 
2025-05-07T20:32:51.7013396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7013934Z 
2025-05-07T20:32:51.7014102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7014536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7014955Z     T=4096,
2025-05-07T20:32:51.7015146Z     D=5120,
2025-05-07T20:32:51.7015353Z     scale_ub=1200.0,
2025-05-07T20:32:51.7015590Z     contiguous=False,
2025-05-07T20:32:51.7015820Z     compiled=True,
2025-05-07T20:32:51.7016036Z )
2025-05-07T20:32:51.7016370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7016881Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.7017173Z 
2025-05-07T20:32:51.7017256Z     @given(
2025-05-07T20:32:51.7017496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7017814Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7018130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7018469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7018812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7019101Z     )
2025-05-07T20:32:51.7019464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7019922Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7020168Z         self,
2025-05-07T20:32:51.7020373Z         T: int,
2025-05-07T20:32:51.7020578Z         D: int,
2025-05-07T20:32:51.7020804Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7021090Z         contiguous: bool,
2025-05-07T20:32:51.7021340Z         compiled: bool,
2025-05-07T20:32:51.7021567Z     ) -> None:
2025-05-07T20:32:51.7021793Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7022050Z     
2025-05-07T20:32:51.7022325Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7022679Z     
2025-05-07T20:32:51.7022879Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7023175Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7023499Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7023749Z         x0 = x[:, :D]
2025-05-07T20:32:51.7023975Z         x1 = x[:, D:]
2025-05-07T20:32:51.7024184Z     
2025-05-07T20:32:51.7024377Z         if contiguous:
2025-05-07T20:32:51.7024688Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7024950Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7025198Z     
2025-05-07T20:32:51.7025402Z         if scale_ub is not None:
2025-05-07T20:32:51.7025679Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7026025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7026347Z             )
2025-05-07T20:32:51.7026549Z         else:
2025-05-07T20:32:51.7026770Z             scale_ub_tensor = None
2025-05-07T20:32:51.7027030Z     
2025-05-07T20:32:51.7027270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7027630Z             op = silu_mul_quant
2025-05-07T20:32:51.7027907Z             if compiled:
2025-05-07T20:32:51.7028159Z                 op = torch.compile(op)
2025-05-07T20:32:51.7028469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7028758Z     
2025-05-07T20:32:51.7028955Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7029132Z 
2025-05-07T20:32:51.7029238Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7029547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7030005Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7030296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7030879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.7031527Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.7032203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7032917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7033473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7034223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7034908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7035459Z     kernel = self.compile(
2025-05-07T20:32:51.7036027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7036709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7037115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7037361Z 
2025-05-07T20:32:51.7037576Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c9be0>
2025-05-07T20:32:51.7038748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7040181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e3240>}
2025-05-07T20:32:51.7041576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7042637Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ca353f0>
2025-05-07T20:32:51.7042946Z 
2025-05-07T20:32:51.7043121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7043661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7044140Z                            module_map=module_map)
2025-05-07T20:32:51.7044520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7044884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7045150Z E       ^
2025-05-07T20:32:51.7045682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7046157Z 
2025-05-07T20:32:51.7046591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7047134Z 
2025-05-07T20:32:51.8192597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8193789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8194684Z     T=2048,
2025-05-07T20:32:51.8195061Z     D=7168,
2025-05-07T20:32:51.8195461Z     scale_ub=1200.0,
2025-05-07T20:32:51.8195918Z     contiguous=False,
2025-05-07T20:32:51.8196377Z     compiled=False,
2025-05-07T20:32:51.8196783Z )
2025-05-07T20:32:51.8197435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8197983Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.8198292Z 
2025-05-07T20:32:51.8198375Z     @given(
2025-05-07T20:32:51.8198624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8198943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8199509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8199858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8200193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8200491Z     )
2025-05-07T20:32:51.8200928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8201383Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8201639Z         self,
2025-05-07T20:32:51.8201847Z         T: int,
2025-05-07T20:32:51.8202055Z         D: int,
2025-05-07T20:32:51.8202273Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8202636Z         contiguous: bool,
2025-05-07T20:32:51.8202883Z         compiled: bool,
2025-05-07T20:32:51.8203109Z     ) -> None:
2025-05-07T20:32:51.8203332Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8203581Z     
2025-05-07T20:32:51.8203866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8204218Z     
2025-05-07T20:32:51.8204415Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8204710Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8205030Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8205282Z         x0 = x[:, :D]
2025-05-07T20:32:51.8205504Z         x1 = x[:, D:]
2025-05-07T20:32:51.8205719Z     
2025-05-07T20:32:51.8205913Z         if contiguous:
2025-05-07T20:32:51.8206390Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8206665Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8206914Z     
2025-05-07T20:32:51.8207105Z         if scale_ub is not None:
2025-05-07T20:32:51.8207390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8207739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8208063Z             )
2025-05-07T20:32:51.8208291Z         else:
2025-05-07T20:32:51.8208530Z             scale_ub_tensor = None
2025-05-07T20:32:51.8208792Z     
2025-05-07T20:32:51.8209024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8209357Z             op = silu_mul_quant
2025-05-07T20:32:51.8209615Z             if compiled:
2025-05-07T20:32:51.8209865Z                 op = torch.compile(op)
2025-05-07T20:32:51.8210175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8210462Z     
2025-05-07T20:32:51.8210661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8210836Z 
2025-05-07T20:32:51.8210941Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8211248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8211587Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8211949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8212669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8213480Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8214035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8214748Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8215442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8216000Z     kernel = self.compile(
2025-05-07T20:32:51.8216556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8217237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8217653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8217893Z 
2025-05-07T20:32:51.8218135Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca835f0>
2025-05-07T20:32:51.8219354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8220792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a4220>}
2025-05-07T20:32:51.8222251Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8223323Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ce13bb0>
2025-05-07T20:32:51.8223678Z 
2025-05-07T20:32:51.8223849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8224393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8224882Z                            module_map=module_map)
2025-05-07T20:32:51.8225261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8225623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8225890Z E       ^
2025-05-07T20:32:51.8226372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8226841Z 
2025-05-07T20:32:51.8227272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8227811Z 
2025-05-07T20:32:51.8227921Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8228350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8228772Z     T=1,
2025-05-07T20:32:51.8228960Z     D=7168,
2025-05-07T20:32:51.8229162Z     scale_ub=None,
2025-05-07T20:32:51.8229385Z     contiguous=True,
2025-05-07T20:32:51.8229613Z     compiled=False,
2025-05-07T20:32:51.8229828Z )
2025-05-07T20:32:51.8230156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8230657Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.8230932Z 
2025-05-07T20:32:51.8231013Z     @given(
2025-05-07T20:32:51.8231252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8231577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8231892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8232232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8232574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8232863Z     )
2025-05-07T20:32:51.8233228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8233685Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8233932Z         self,
2025-05-07T20:32:51.8234136Z         T: int,
2025-05-07T20:32:51.8234393Z         D: int,
2025-05-07T20:32:51.8234616Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8234894Z         contiguous: bool,
2025-05-07T20:32:51.8235144Z         compiled: bool,
2025-05-07T20:32:51.8235375Z     ) -> None:
2025-05-07T20:32:51.8235600Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8235852Z     
2025-05-07T20:32:51.8236133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8236492Z     
2025-05-07T20:32:51.8236697Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8236993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8237314Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8237568Z         x0 = x[:, :D]
2025-05-07T20:32:51.8237793Z         x1 = x[:, D:]
2025-05-07T20:32:51.8238008Z     
2025-05-07T20:32:51.8238201Z         if contiguous:
2025-05-07T20:32:51.8238443Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8238704Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8238956Z     
2025-05-07T20:32:51.8239158Z         if scale_ub is not None:
2025-05-07T20:32:51.8239436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8239831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8240149Z             )
2025-05-07T20:32:51.8240343Z         else:
2025-05-07T20:32:51.8240560Z             scale_ub_tensor = None
2025-05-07T20:32:51.8240819Z     
2025-05-07T20:32:51.8241096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8241419Z             op = silu_mul_quant
2025-05-07T20:32:51.8241678Z             if compiled:
2025-05-07T20:32:51.8241926Z                 op = torch.compile(op)
2025-05-07T20:32:51.8242230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8242557Z     
2025-05-07T20:32:51.8242759Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8242928Z 
2025-05-07T20:32:51.8243030Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8243341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8243686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8243970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8244687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8245402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8245962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8246666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8247356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8247912Z     kernel = self.compile(
2025-05-07T20:32:51.8248474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8249176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8249594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8249836Z 
2025-05-07T20:32:51.8250062Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca83ad0>
2025-05-07T20:32:51.8251201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8252724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a5120>}
2025-05-07T20:32:51.8254157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8255294Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1eb7f0>
2025-05-07T20:32:51.8255600Z 
2025-05-07T20:32:51.8255786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8256332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8256825Z                            module_map=module_map)
2025-05-07T20:32:51.8257208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8257599Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8257895Z E       ^
2025-05-07T20:32:51.8258385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8258867Z 
2025-05-07T20:32:51.8259319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8259864Z 
2025-05-07T20:32:51.8259976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8260422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8260849Z     T=16384,
2025-05-07T20:32:51.8261124Z     D=7168,
2025-05-07T20:32:51.8261331Z     scale_ub=1200.0,
2025-05-07T20:32:51.8261568Z     contiguous=False,
2025-05-07T20:32:51.8261798Z     compiled=True,
2025-05-07T20:32:52.0663208Z )
2025-05-07T20:32:52.0664100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.0664804Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.0665097Z 
2025-05-07T20:32:52.0665179Z     @given(
2025-05-07T20:32:52.0665422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.0665854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.0666166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.0666495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.0666839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.0667132Z     )
2025-05-07T20:32:52.0667485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.0667949Z     def test_silu_mul_quant(
2025-05-07T20:32:52.0668249Z         self,
2025-05-07T20:32:52.0668450Z         T: int,
2025-05-07T20:32:52.0668658Z         D: int,
2025-05-07T20:32:52.0668891Z         scale_ub: Optional[float],
2025-05-07T20:32:52.0669172Z         contiguous: bool,
2025-05-07T20:32:52.0669426Z         compiled: bool,
2025-05-07T20:32:52.0669665Z     ) -> None:
2025-05-07T20:32:52.0676937Z         torch.manual_seed(2025)
2025-05-07T20:32:52.0677225Z     
2025-05-07T20:32:52.0677512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.0677871Z     
2025-05-07T20:32:52.0678070Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.0678419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.0678739Z         x = x_sign * x_clamp
2025-05-07T20:32:52.0678982Z         x0 = x[:, :D]
2025-05-07T20:32:52.0679208Z         x1 = x[:, D:]
2025-05-07T20:32:52.0679425Z     
2025-05-07T20:32:52.0679614Z         if contiguous:
2025-05-07T20:32:52.0679864Z             x0 = x0.contiguous()
2025-05-07T20:32:52.0680132Z             x1 = x1.contiguous()
2025-05-07T20:32:52.0680375Z     
2025-05-07T20:32:52.0680563Z         if scale_ub is not None:
2025-05-07T20:32:52.0680834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.0681185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.0681502Z             )
2025-05-07T20:32:52.0681703Z         else:
2025-05-07T20:32:52.0681922Z             scale_ub_tensor = None
2025-05-07T20:32:52.0682174Z     
2025-05-07T20:32:52.0682418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.0682745Z             op = silu_mul_quant
2025-05-07T20:32:52.0682996Z             if compiled:
2025-05-07T20:32:52.0683249Z                 op = torch.compile(op)
2025-05-07T20:32:52.0683686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0683967Z     
2025-05-07T20:32:52.0684166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.0684342Z 
2025-05-07T20:32:52.0684445Z moe/activation_test.py:117: 
2025-05-07T20:32:52.0684753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0685088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.0685381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0685961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.0686535Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.0687215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.0687976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.0688532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.0689234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.0690005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.0690564Z     kernel = self.compile(
2025-05-07T20:32:52.0691120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.0691917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.0692333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0692572Z 
2025-05-07T20:32:52.0692792Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca81a90>
2025-05-07T20:32:52.0693958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.0695396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a6520>}
2025-05-07T20:32:52.0696792Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.0697862Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1af9b0>
2025-05-07T20:32:52.0698160Z 
2025-05-07T20:32:52.0698338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.0698874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.0699361Z                            module_map=module_map)
2025-05-07T20:32:52.0699741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.0700101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.0700372Z E       ^
2025-05-07T20:32:52.0700856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.0701326Z 
2025-05-07T20:32:52.0701763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.0702298Z 
2025-05-07T20:32:52.0702409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.0702841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.0703260Z     T=1,
2025-05-07T20:32:52.0703448Z     D=7168,
2025-05-07T20:32:52.0703656Z     scale_ub=None,
2025-05-07T20:32:52.0703882Z     contiguous=False,
2025-05-07T20:32:52.0704114Z     compiled=False,
2025-05-07T20:32:52.0704334Z )
2025-05-07T20:32:52.0704715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.0705228Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.0705501Z 
2025-05-07T20:32:52.0705583Z     @given(
2025-05-07T20:32:52.0705824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.0706607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.0706927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.0707273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.0707618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.0707909Z     )
2025-05-07T20:32:52.0708268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.0708723Z     def test_silu_mul_quant(
2025-05-07T20:32:52.0708977Z         self,
2025-05-07T20:32:52.0709173Z         T: int,
2025-05-07T20:32:52.0709378Z         D: int,
2025-05-07T20:32:52.0709602Z         scale_ub: Optional[float],
2025-05-07T20:32:52.0709879Z         contiguous: bool,
2025-05-07T20:32:52.0710130Z         compiled: bool,
2025-05-07T20:32:52.0710364Z     ) -> None:
2025-05-07T20:32:52.0710582Z         torch.manual_seed(2025)
2025-05-07T20:32:52.0710911Z     
2025-05-07T20:32:52.0711193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.0711541Z     
2025-05-07T20:32:52.0711748Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.0712111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.0712432Z         x = x_sign * x_clamp
2025-05-07T20:32:52.0712683Z         x0 = x[:, :D]
2025-05-07T20:32:52.0712909Z         x1 = x[:, D:]
2025-05-07T20:32:52.0713116Z     
2025-05-07T20:32:52.0713309Z         if contiguous:
2025-05-07T20:32:52.0713550Z             x0 = x0.contiguous()
2025-05-07T20:32:52.0713899Z             x1 = x1.contiguous()
2025-05-07T20:32:52.0714145Z     
2025-05-07T20:32:52.0714344Z         if scale_ub is not None:
2025-05-07T20:32:52.0714625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.0714969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.0715289Z             )
2025-05-07T20:32:52.0715490Z         else:
2025-05-07T20:32:52.0715709Z             scale_ub_tensor = None
2025-05-07T20:32:52.0715972Z     
2025-05-07T20:32:52.0716216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.0716539Z             op = silu_mul_quant
2025-05-07T20:32:52.0716803Z             if compiled:
2025-05-07T20:32:52.0717060Z                 op = torch.compile(op)
2025-05-07T20:32:52.0717362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0717648Z     
2025-05-07T20:32:52.0717851Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.0718020Z 
2025-05-07T20:32:52.0718123Z moe/activation_test.py:117: 
2025-05-07T20:32:52.0718437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0718785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.0719079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.0719790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.0720510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.0721069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.0721776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.0722471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.0723025Z     kernel = self.compile(
2025-05-07T20:32:52.0723584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.0724261Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.0724759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.0725000Z 
2025-05-07T20:32:52.0725218Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca83470>
2025-05-07T20:32:52.0726348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.0727771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a7100>}
2025-05-07T20:32:52.0729169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.0730238Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c76b870>
2025-05-07T20:32:52.0730539Z 
2025-05-07T20:32:52.0730719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.0731298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.0731785Z                            module_map=module_map)
2025-05-07T20:32:52.0732262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.0732628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.0732934Z E       ^
2025-05-07T20:32:52.0733415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.0733883Z 
2025-05-07T20:32:52.0734320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.0734893Z 
2025-05-07T20:32:52.0735005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.0735427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.0735850Z     T=2048,
2025-05-07T20:32:52.0736043Z     D=7168,
2025-05-07T20:32:52.0736238Z     scale_ub=None,
2025-05-07T20:32:52.0736462Z     contiguous=False,
2025-05-07T20:32:52.0736701Z     compiled=True,
2025-05-07T20:32:52.0736907Z )
2025-05-07T20:32:52.1601255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1602036Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.1602447Z 
2025-05-07T20:32:52.1602531Z     @given(
2025-05-07T20:32:52.1602765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1603077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1603397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1603741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1604080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1604366Z     )
2025-05-07T20:32:52.1604732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1605194Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1605441Z         self,
2025-05-07T20:32:52.1605645Z         T: int,
2025-05-07T20:32:52.1605858Z         D: int,
2025-05-07T20:32:52.1606078Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1606611Z         contiguous: bool,
2025-05-07T20:32:52.1606863Z         compiled: bool,
2025-05-07T20:32:52.1607100Z     ) -> None:
2025-05-07T20:32:52.1607327Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1607605Z     
2025-05-07T20:32:52.1607903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1608262Z     
2025-05-07T20:32:52.1608462Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1608759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1609078Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1609324Z         x0 = x[:, :D]
2025-05-07T20:32:52.1609546Z         x1 = x[:, D:]
2025-05-07T20:32:52.1609752Z     
2025-05-07T20:32:52.1610238Z         if contiguous:
2025-05-07T20:32:52.1610480Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1610742Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1610993Z     
2025-05-07T20:32:52.1611190Z         if scale_ub is not None:
2025-05-07T20:32:52.1611464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1611883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1612207Z             )
2025-05-07T20:32:52.1612404Z         else:
2025-05-07T20:32:52.1612621Z             scale_ub_tensor = None
2025-05-07T20:32:52.1612882Z     
2025-05-07T20:32:52.1613117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1613441Z             op = silu_mul_quant
2025-05-07T20:32:52.1613701Z             if compiled:
2025-05-07T20:32:52.1613949Z                 op = torch.compile(op)
2025-05-07T20:32:52.1614265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1614550Z     
2025-05-07T20:32:52.1614748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1614923Z 
2025-05-07T20:32:52.1615026Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1615422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1615769Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1616055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1616637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.1617288Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.1617971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1618689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1619318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1620034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1620718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1621274Z     kernel = self.compile(
2025-05-07T20:32:52.1621837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1622512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1622929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1623176Z 
2025-05-07T20:32:52.1623391Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c2ba0>
2025-05-07T20:32:52.1624516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1625964Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d0720>}
2025-05-07T20:32:52.1627353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1628469Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c7f82f0>
2025-05-07T20:32:52.1628774Z 
2025-05-07T20:32:52.1628945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1629485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1629965Z                            module_map=module_map)
2025-05-07T20:32:52.1630338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1630707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1630969Z E       ^
2025-05-07T20:32:52.1631493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1631967Z 
2025-05-07T20:32:52.1632401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1632933Z 
2025-05-07T20:32:52.1633045Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1633472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1633892Z     T=4096,
2025-05-07T20:32:52.1634089Z     D=7168,
2025-05-07T20:32:52.1634286Z     scale_ub=None,
2025-05-07T20:32:52.1634507Z     contiguous=False,
2025-05-07T20:32:52.1634741Z     compiled=True,
2025-05-07T20:32:52.1634951Z )
2025-05-07T20:32:52.1635280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1635797Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.1636080Z 
2025-05-07T20:32:52.1636174Z     @given(
2025-05-07T20:32:52.1636408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1636783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1637105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1637437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1637781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1638128Z     )
2025-05-07T20:32:52.1638532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1638995Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1639250Z         self,
2025-05-07T20:32:52.1639455Z         T: int,
2025-05-07T20:32:52.1639659Z         D: int,
2025-05-07T20:32:52.1639929Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1640213Z         contiguous: bool,
2025-05-07T20:32:52.1640455Z         compiled: bool,
2025-05-07T20:32:52.1640692Z     ) -> None:
2025-05-07T20:32:52.1640919Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1641164Z     
2025-05-07T20:32:52.1641444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1641799Z     
2025-05-07T20:32:52.1641995Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1642300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1642623Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1642873Z         x0 = x[:, :D]
2025-05-07T20:32:52.1643099Z         x1 = x[:, D:]
2025-05-07T20:32:52.1643312Z     
2025-05-07T20:32:52.1643499Z         if contiguous:
2025-05-07T20:32:52.1643738Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1644010Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1644251Z     
2025-05-07T20:32:52.1644455Z         if scale_ub is not None:
2025-05-07T20:32:52.1644734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1645076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1645394Z             )
2025-05-07T20:32:52.1645599Z         else:
2025-05-07T20:32:52.1645821Z             scale_ub_tensor = None
2025-05-07T20:32:52.1646072Z     
2025-05-07T20:32:52.1646314Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1646639Z             op = silu_mul_quant
2025-05-07T20:32:52.1646891Z             if compiled:
2025-05-07T20:32:52.1647148Z                 op = torch.compile(op)
2025-05-07T20:32:52.1647457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1647734Z     
2025-05-07T20:32:52.1647933Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1648103Z 
2025-05-07T20:32:52.1648209Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1648506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1648852Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1649142Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1649769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.1650343Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.1651026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1651736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1652350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1653060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1653748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1654302Z     kernel = self.compile(
2025-05-07T20:32:52.1654863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1655549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1655970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1656209Z 
2025-05-07T20:32:52.1656484Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c0d70>
2025-05-07T20:32:52.1657602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1659065Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d1440>}
2025-05-07T20:32:52.1660462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1661580Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c23da30>
2025-05-07T20:32:52.1661878Z 
2025-05-07T20:32:52.1662051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1662596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1663083Z                            module_map=module_map)
2025-05-07T20:32:52.1663460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1663824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1664097Z E       ^
2025-05-07T20:32:52.1664580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1665047Z 
2025-05-07T20:32:52.1665487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1666022Z 
2025-05-07T20:32:52.3253995Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3254689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3255261Z     T=16384,
2025-05-07T20:32:52.3255560Z     D=5120,
2025-05-07T20:32:52.3255755Z     scale_ub=1200.0,
2025-05-07T20:32:52.3255983Z     contiguous=False,
2025-05-07T20:32:52.3256207Z     compiled=False,
2025-05-07T20:32:52.3256421Z )
2025-05-07T20:32:52.3256747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3257282Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3257573Z 
2025-05-07T20:32:52.3257659Z     @given(
2025-05-07T20:32:52.3257890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3258219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3258541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3258881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3259217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3259513Z     )
2025-05-07T20:32:52.3260165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3260623Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3260878Z         self,
2025-05-07T20:32:52.3261082Z         T: int,
2025-05-07T20:32:52.3261282Z         D: int,
2025-05-07T20:32:52.3261512Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3261792Z         contiguous: bool,
2025-05-07T20:32:52.3262038Z         compiled: bool,
2025-05-07T20:32:52.3262275Z     ) -> None:
2025-05-07T20:32:52.3262499Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3262745Z     
2025-05-07T20:32:52.3263029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3263386Z     
2025-05-07T20:32:52.3263592Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3263887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3264209Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3264459Z         x0 = x[:, :D]
2025-05-07T20:32:52.3264686Z         x1 = x[:, D:]
2025-05-07T20:32:52.3264903Z     
2025-05-07T20:32:52.3265102Z         if contiguous:
2025-05-07T20:32:52.3265338Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3265706Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3265963Z     
2025-05-07T20:32:52.3266157Z         if scale_ub is not None:
2025-05-07T20:32:52.3266439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3266864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3267177Z             )
2025-05-07T20:32:52.3267379Z         else:
2025-05-07T20:32:52.3267598Z             scale_ub_tensor = None
2025-05-07T20:32:52.3267854Z     
2025-05-07T20:32:52.3268092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3268520Z             op = silu_mul_quant
2025-05-07T20:32:52.3268780Z             if compiled:
2025-05-07T20:32:52.3269029Z                 op = torch.compile(op)
2025-05-07T20:32:52.3269337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3269622Z     
2025-05-07T20:32:52.3269816Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3269992Z 
2025-05-07T20:32:52.3270094Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3270403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3270745Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3271034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3271755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3272470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3273019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3273731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3274424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3274975Z     kernel = self.compile(
2025-05-07T20:32:52.3275538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3276219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3276633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3276874Z 
2025-05-07T20:32:52.3277087Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c3e90>
2025-05-07T20:32:52.3278222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3279750Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d2340>}
2025-05-07T20:32:52.3281182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3282266Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c230fb0>
2025-05-07T20:32:52.3282574Z 
2025-05-07T20:32:52.3282750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3283300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3290647Z                            module_map=module_map)
2025-05-07T20:32:52.3291251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3291630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3291941Z E       ^
2025-05-07T20:32:52.3292429Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3292895Z 
2025-05-07T20:32:52.3293412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3293958Z 
2025-05-07T20:32:52.3294065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3294495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3294963Z     T=16384,
2025-05-07T20:32:52.3295159Z     D=5120,
2025-05-07T20:32:52.3295359Z     scale_ub=1200.0,
2025-05-07T20:32:52.3295597Z     contiguous=True,
2025-05-07T20:32:52.3295820Z     compiled=True,
2025-05-07T20:32:52.3296031Z )
2025-05-07T20:32:52.3296367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3296928Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3297228Z 
2025-05-07T20:32:52.3297307Z     @given(
2025-05-07T20:32:52.3297548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3297867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3298183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3298528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3298866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3299153Z     )
2025-05-07T20:32:52.3299512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3299972Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3300216Z         self,
2025-05-07T20:32:52.3300415Z         T: int,
2025-05-07T20:32:52.3300616Z         D: int,
2025-05-07T20:32:52.3300835Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3301113Z         contiguous: bool,
2025-05-07T20:32:52.3301362Z         compiled: bool,
2025-05-07T20:32:52.3301588Z     ) -> None:
2025-05-07T20:32:52.3301816Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3302067Z     
2025-05-07T20:32:52.3302345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3302695Z     
2025-05-07T20:32:52.3302892Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3303190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3303508Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3303756Z         x0 = x[:, :D]
2025-05-07T20:32:52.3303972Z         x1 = x[:, D:]
2025-05-07T20:32:52.3304190Z     
2025-05-07T20:32:52.3304383Z         if contiguous:
2025-05-07T20:32:52.3304619Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3304877Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3305120Z     
2025-05-07T20:32:52.3305316Z         if scale_ub is not None:
2025-05-07T20:32:52.3305596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3305946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3306579Z             )
2025-05-07T20:32:52.3306774Z         else:
2025-05-07T20:32:52.3306993Z             scale_ub_tensor = None
2025-05-07T20:32:52.3307340Z     
2025-05-07T20:32:52.3307575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3307896Z             op = silu_mul_quant
2025-05-07T20:32:52.3308177Z             if compiled:
2025-05-07T20:32:52.3308523Z                 op = torch.compile(op)
2025-05-07T20:32:52.3308868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3309151Z     
2025-05-07T20:32:52.3309347Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3309522Z 
2025-05-07T20:32:52.3309624Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3309934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3310279Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3310564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3311146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3311731Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3312408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3313125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3313766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3314473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3315227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3315780Z     kernel = self.compile(
2025-05-07T20:32:52.3316343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3317085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3317494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3317738Z 
2025-05-07T20:32:52.3317956Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c3d70>
2025-05-07T20:32:52.3319078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3320503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d39c0>}
2025-05-07T20:32:52.3321892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3322958Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c2037f0>
2025-05-07T20:32:52.3323262Z 
2025-05-07T20:32:52.3323433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3323978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3324460Z                            module_map=module_map)
2025-05-07T20:32:52.3324838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3325204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3325467Z E       ^
2025-05-07T20:32:52.3325951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3326422Z 
2025-05-07T20:32:52.3326852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3327388Z 
2025-05-07T20:32:52.5025540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5026123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5026546Z     T=16384,
2025-05-07T20:32:52.5026991Z     D=5120,
2025-05-07T20:32:52.5027219Z     scale_ub=None,
2025-05-07T20:32:52.5027441Z     contiguous=False,
2025-05-07T20:32:52.5027675Z     compiled=True,
2025-05-07T20:32:52.5027937Z )
2025-05-07T20:32:52.5028287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.5028807Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.5029108Z 
2025-05-07T20:32:52.5029192Z     @given(
2025-05-07T20:32:52.5029434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.5029759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.5030073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.5030416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.5030763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.5031057Z     )
2025-05-07T20:32:52.5031425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.5031891Z     def test_silu_mul_quant(
2025-05-07T20:32:52.5032140Z         self,
2025-05-07T20:32:52.5032345Z         T: int,
2025-05-07T20:32:52.5032552Z         D: int,
2025-05-07T20:32:52.5032859Z         scale_ub: Optional[float],
2025-05-07T20:32:52.5033145Z         contiguous: bool,
2025-05-07T20:32:52.5033399Z         compiled: bool,
2025-05-07T20:32:52.5033631Z     ) -> None:
2025-05-07T20:32:52.5033930Z         torch.manual_seed(2025)
2025-05-07T20:32:52.5034184Z     
2025-05-07T20:32:52.5034462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.5034819Z     
2025-05-07T20:32:52.5035024Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.5035322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.5035713Z         x = x_sign * x_clamp
2025-05-07T20:32:52.5035962Z         x0 = x[:, :D]
2025-05-07T20:32:52.5036187Z         x1 = x[:, D:]
2025-05-07T20:32:52.5036397Z     
2025-05-07T20:32:52.5036594Z         if contiguous:
2025-05-07T20:32:52.5036840Z             x0 = x0.contiguous()
2025-05-07T20:32:52.5037103Z             x1 = x1.contiguous()
2025-05-07T20:32:52.5037352Z     
2025-05-07T20:32:52.5037553Z         if scale_ub is not None:
2025-05-07T20:32:52.5037829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.5038223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.5038546Z             )
2025-05-07T20:32:52.5038745Z         else:
2025-05-07T20:32:52.5038966Z             scale_ub_tensor = None
2025-05-07T20:32:52.5039226Z     
2025-05-07T20:32:52.5039462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.5039791Z             op = silu_mul_quant
2025-05-07T20:32:52.5040052Z             if compiled:
2025-05-07T20:32:52.5040313Z                 op = torch.compile(op)
2025-05-07T20:32:52.5040611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5040898Z     
2025-05-07T20:32:52.5041105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.5041274Z 
2025-05-07T20:32:52.5041379Z moe/activation_test.py:117: 
2025-05-07T20:32:52.5041688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5042039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.5042329Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5042919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.5043510Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.5044196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.5044909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.5045468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.5046190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.5046933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.5047496Z     kernel = self.compile(
2025-05-07T20:32:52.5048069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.5048756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5049172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5049420Z 
2025-05-07T20:32:52.5049640Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25c7a0>
2025-05-07T20:32:52.5050772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.5052303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce40c20>}
2025-05-07T20:32:52.5053753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.5054842Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cec2570>
2025-05-07T20:32:52.5055222Z 
2025-05-07T20:32:52.5055396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.5055952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.5056438Z                            module_map=module_map)
2025-05-07T20:32:52.5056860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.5057232Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.5057503Z E       ^
2025-05-07T20:32:52.5057994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.5058478Z 
2025-05-07T20:32:52.5058922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.5059467Z 
2025-05-07T20:32:52.5059580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5060024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5060447Z     T=2048,
2025-05-07T20:32:52.5060643Z     D=5120,
2025-05-07T20:32:52.5060842Z     scale_ub=None,
2025-05-07T20:32:52.5061062Z     contiguous=False,
2025-05-07T20:32:52.5061297Z     compiled=True,
2025-05-07T20:32:52.5061519Z )
2025-05-07T20:32:52.5967326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.5968132Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.5968737Z 
2025-05-07T20:32:52.5968899Z     @given(
2025-05-07T20:32:52.5969386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.5970035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.5970671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.5971354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.5972132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.5972718Z     )
2025-05-07T20:32:52.5973446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.5974367Z     def test_silu_mul_quant(
2025-05-07T20:32:52.5974856Z         self,
2025-05-07T20:32:52.5975259Z         T: int,
2025-05-07T20:32:52.5975664Z         D: int,
2025-05-07T20:32:52.5976106Z         scale_ub: Optional[float],
2025-05-07T20:32:52.5976658Z         contiguous: bool,
2025-05-07T20:32:52.5977151Z         compiled: bool,
2025-05-07T20:32:52.5977611Z     ) -> None:
2025-05-07T20:32:52.5978046Z         torch.manual_seed(2025)
2025-05-07T20:32:52.5978665Z     
2025-05-07T20:32:52.5978953Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.5979304Z     
2025-05-07T20:32:52.5979505Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.5979811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.5980128Z         x = x_sign * x_clamp
2025-05-07T20:32:52.5980376Z         x0 = x[:, :D]
2025-05-07T20:32:52.5980606Z         x1 = x[:, D:]
2025-05-07T20:32:52.5980816Z     
2025-05-07T20:32:52.5981008Z         if contiguous:
2025-05-07T20:32:52.5981247Z             x0 = x0.contiguous()
2025-05-07T20:32:52.5981510Z             x1 = x1.contiguous()
2025-05-07T20:32:52.5981758Z     
2025-05-07T20:32:52.5981956Z         if scale_ub is not None:
2025-05-07T20:32:52.5982234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.5982581Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.5982905Z             )
2025-05-07T20:32:52.5983104Z         else:
2025-05-07T20:32:52.5983322Z             scale_ub_tensor = None
2025-05-07T20:32:52.5983587Z     
2025-05-07T20:32:52.5983826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.5984235Z             op = silu_mul_quant
2025-05-07T20:32:52.5984505Z             if compiled:
2025-05-07T20:32:52.5984765Z                 op = torch.compile(op)
2025-05-07T20:32:52.5985067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5985424Z     
2025-05-07T20:32:52.5985626Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.5985797Z 
2025-05-07T20:32:52.5985900Z moe/activation_test.py:117: 
2025-05-07T20:32:52.5986209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5986554Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.5986915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5987497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.5988143Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.5988839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.5989559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.5990124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.5990846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.5991548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.5992104Z     kernel = self.compile(
2025-05-07T20:32:52.5992672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.5993366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5993785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5994030Z 
2025-05-07T20:32:52.5994247Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25f2c0>
2025-05-07T20:32:52.5995393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.5996867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce419e0>}
2025-05-07T20:32:52.5998298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.5999428Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cea9db0>
2025-05-07T20:32:52.5999738Z 
2025-05-07T20:32:52.5999961Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6000516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6001009Z                            module_map=module_map)
2025-05-07T20:32:52.6001381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6001750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6002024Z E       ^
2025-05-07T20:32:52.6002505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6002988Z 
2025-05-07T20:32:52.6003426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6003976Z 
2025-05-07T20:32:52.6004085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6004524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6004944Z     T=2048,
2025-05-07T20:32:52.6005144Z     D=5120,
2025-05-07T20:32:52.6005347Z     scale_ub=1200.0,
2025-05-07T20:32:52.6005576Z     contiguous=False,
2025-05-07T20:32:52.6005863Z     compiled=True,
2025-05-07T20:32:52.6006076Z )
2025-05-07T20:32:52.6006665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.6007193Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.6007563Z 
2025-05-07T20:32:52.6007652Z     @given(
2025-05-07T20:32:52.6007890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.6008214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.6008535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.6008941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.6009280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.6009578Z     )
2025-05-07T20:32:52.6009947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.6010407Z     def test_silu_mul_quant(
2025-05-07T20:32:52.6010664Z         self,
2025-05-07T20:32:52.6010873Z         T: int,
2025-05-07T20:32:52.6011076Z         D: int,
2025-05-07T20:32:52.6011308Z         scale_ub: Optional[float],
2025-05-07T20:32:52.6011593Z         contiguous: bool,
2025-05-07T20:32:52.6011913Z         compiled: bool,
2025-05-07T20:32:52.6012152Z     ) -> None:
2025-05-07T20:32:52.6012379Z         torch.manual_seed(2025)
2025-05-07T20:32:52.6012632Z     
2025-05-07T20:32:52.6012911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.6013272Z     
2025-05-07T20:32:52.6013476Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.6013775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.6014105Z         x = x_sign * x_clamp
2025-05-07T20:32:52.6014362Z         x0 = x[:, :D]
2025-05-07T20:32:52.6014585Z         x1 = x[:, D:]
2025-05-07T20:32:52.6014805Z     
2025-05-07T20:32:52.6014998Z         if contiguous:
2025-05-07T20:32:52.6015232Z             x0 = x0.contiguous()
2025-05-07T20:32:52.6015500Z             x1 = x1.contiguous()
2025-05-07T20:32:52.6015750Z     
2025-05-07T20:32:52.6015941Z         if scale_ub is not None:
2025-05-07T20:32:52.6016225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.6016573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.6016893Z             )
2025-05-07T20:32:52.6017098Z         else:
2025-05-07T20:32:52.6017317Z             scale_ub_tensor = None
2025-05-07T20:32:52.6017572Z     
2025-05-07T20:32:52.6017842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.6018197Z             op = silu_mul_quant
2025-05-07T20:32:52.6018465Z             if compiled:
2025-05-07T20:32:52.6018715Z                 op = torch.compile(op)
2025-05-07T20:32:52.6019024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6019312Z     
2025-05-07T20:32:52.6019590Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.6019771Z 
2025-05-07T20:32:52.6019877Z moe/activation_test.py:117: 
2025-05-07T20:32:52.6020187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6020529Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.6020823Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6021413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.6022005Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.6022694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.6023430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.6023998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.6024717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.6025421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.6026054Z     kernel = self.compile(
2025-05-07T20:32:52.6026632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.6027325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.6027799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6028083Z 
2025-05-07T20:32:52.6028310Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25f860>
2025-05-07T20:32:52.6029471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.6030983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce42b60>}
2025-05-07T20:32:52.6032424Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.6033526Z context = <triton._C.libtriton.ir.context object at 0x7fd871f2e1f0>
2025-05-07T20:32:52.6033830Z 
2025-05-07T20:32:52.6034012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6034564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6035067Z                            module_map=module_map)
2025-05-07T20:32:52.6035448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6035820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6036091Z E       ^
2025-05-07T20:32:52.6036587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6037071Z 
2025-05-07T20:32:52.6037521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6038070Z 
2025-05-07T20:32:52.7781642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.7782117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.7782700Z     T=4096,
2025-05-07T20:32:52.7782897Z     D=5120,
2025-05-07T20:32:52.7783093Z     scale_ub=1200.0,
2025-05-07T20:32:52.7783321Z     contiguous=True,
2025-05-07T20:32:52.7783547Z     compiled=True,
2025-05-07T20:32:52.7783760Z )
2025-05-07T20:32:52.7784096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.7784625Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.7785168Z 
2025-05-07T20:32:52.7785257Z     @given(
2025-05-07T20:32:52.7785486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.7785815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.7786137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.7786472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.7786812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.7787114Z     )
﻿2025-05-07T20:32:52.7790489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.7790954Z     def test_silu_mul_quant(
2025-05-07T20:32:52.7791205Z         self,
2025-05-07T20:32:52.7791399Z         T: int,
2025-05-07T20:32:52.7791607Z         D: int,
2025-05-07T20:32:52.7791831Z         scale_ub: Optional[float],
2025-05-07T20:32:52.7792105Z         contiguous: bool,
2025-05-07T20:32:52.7792355Z         compiled: bool,
2025-05-07T20:32:52.7792594Z     ) -> None:
2025-05-07T20:32:52.7792815Z         torch.manual_seed(2025)
2025-05-07T20:32:52.7793074Z     
2025-05-07T20:32:52.7793371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.7801109Z     
2025-05-07T20:32:52.7801337Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.7801650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.7801976Z         x = x_sign * x_clamp
2025-05-07T20:32:52.7802220Z         x0 = x[:, :D]
2025-05-07T20:32:52.7802443Z         x1 = x[:, D:]
2025-05-07T20:32:52.7802686Z     
2025-05-07T20:32:52.7802873Z         if contiguous:
2025-05-07T20:32:52.7803115Z             x0 = x0.contiguous()
2025-05-07T20:32:52.7803382Z             x1 = x1.contiguous()
2025-05-07T20:32:52.7803624Z     
2025-05-07T20:32:52.7803919Z         if scale_ub is not None:
2025-05-07T20:32:52.7804199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.7804546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.7804872Z             )
2025-05-07T20:32:52.7805077Z         else:
2025-05-07T20:32:52.7805290Z             scale_ub_tensor = None
2025-05-07T20:32:52.7805572Z     
2025-05-07T20:32:52.7805816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.7806493Z             op = silu_mul_quant
2025-05-07T20:32:52.7806753Z             if compiled:
2025-05-07T20:32:52.7807011Z                 op = torch.compile(op)
2025-05-07T20:32:52.7807319Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7807601Z     
2025-05-07T20:32:52.7807800Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.7807969Z 
2025-05-07T20:32:52.7808077Z moe/activation_test.py:117: 
2025-05-07T20:32:52.7808382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7808730Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.7809019Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7809602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.7810178Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.7810863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.7811574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.7812203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.7812909Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.7813600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.7814158Z     kernel = self.compile(
2025-05-07T20:32:52.7814717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.7815402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.7815904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7816145Z 
2025-05-07T20:32:52.7816370Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fd8620>
2025-05-07T20:32:52.7817485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.7819021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f58220>}
2025-05-07T20:32:52.7820422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.7821495Z context = <triton._C.libtriton.ir.context object at 0x7fd871f74170>
2025-05-07T20:32:52.7821792Z 
2025-05-07T20:32:52.7821964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.7822571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.7823063Z                            module_map=module_map)
2025-05-07T20:32:52.7823441Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.7823802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.7824073Z E       ^
2025-05-07T20:32:52.7824557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.7825026Z 
2025-05-07T20:32:52.7825525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.7826067Z 
2025-05-07T20:32:52.7826173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.7826608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.7827029Z     T=128,
2025-05-07T20:32:52.7827219Z     D=5120,
2025-05-07T20:32:52.7827421Z     scale_ub=1200.0,
2025-05-07T20:32:52.7827655Z     contiguous=False,
2025-05-07T20:32:52.7827886Z     compiled=True,
2025-05-07T20:32:52.7828123Z )
2025-05-07T20:32:53.0494816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0495483Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.0495902Z 
2025-05-07T20:32:53.0496004Z     @given(
2025-05-07T20:32:53.0496304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0496718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0497132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0497473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0497827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0498162Z     )
2025-05-07T20:32:53.0498519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0498977Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0499239Z         self,
2025-05-07T20:32:53.0499439Z         T: int,
2025-05-07T20:32:53.0499646Z         D: int,
2025-05-07T20:32:53.0499876Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0500151Z         contiguous: bool,
2025-05-07T20:32:53.0500403Z         compiled: bool,
2025-05-07T20:32:53.0500642Z     ) -> None:
2025-05-07T20:32:53.0500862Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0501112Z     
2025-05-07T20:32:53.0501394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0501753Z     
2025-05-07T20:32:53.0501952Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0502258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0502583Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0502828Z         x0 = x[:, :D]
2025-05-07T20:32:53.0503356Z         x1 = x[:, D:]
2025-05-07T20:32:53.0503579Z     
2025-05-07T20:32:53.0503770Z         if contiguous:
2025-05-07T20:32:53.0504014Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0504288Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0504536Z     
2025-05-07T20:32:53.0504739Z         if scale_ub is not None:
2025-05-07T20:32:53.0505021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0505363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0505806Z             )
2025-05-07T20:32:53.0506009Z         else:
2025-05-07T20:32:53.0506502Z             scale_ub_tensor = None
2025-05-07T20:32:53.0506767Z     
2025-05-07T20:32:53.0507009Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0507336Z             op = silu_mul_quant
2025-05-07T20:32:53.0507597Z             if compiled:
2025-05-07T20:32:53.0507873Z                 op = torch.compile(op)
2025-05-07T20:32:53.0508218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0508498Z     
2025-05-07T20:32:53.0508702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0508873Z 
2025-05-07T20:32:53.0509075Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0509382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0509731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0510024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0510599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0511184Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0511866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0512661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0513215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0513928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0514619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0515173Z     kernel = self.compile(
2025-05-07T20:32:53.0515730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0516412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0516829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0517067Z 
2025-05-07T20:32:53.0517281Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fd9460>
2025-05-07T20:32:53.0518411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0519858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f58ea0>}
2025-05-07T20:32:53.0521256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0522318Z context = <triton._C.libtriton.ir.context object at 0x7fd871eaa370>
2025-05-07T20:32:53.0522617Z 
2025-05-07T20:32:53.0522790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0523341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0523833Z                            module_map=module_map)
2025-05-07T20:32:53.0524210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0524646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0524922Z E       ^
2025-05-07T20:32:53.0525405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0525873Z 
2025-05-07T20:32:53.0526305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0526843Z 
2025-05-07T20:32:53.0526950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0527446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0527867Z     T=16384,
2025-05-07T20:32:53.0528068Z     D=7168,
2025-05-07T20:32:53.0528274Z     scale_ub=1200.0,
2025-05-07T20:32:53.0528513Z     contiguous=True,
2025-05-07T20:32:53.0528741Z     compiled=True,
2025-05-07T20:32:53.0528963Z )
2025-05-07T20:32:53.0529310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0529826Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.0530120Z 
2025-05-07T20:32:53.0530204Z     @given(
2025-05-07T20:32:53.0530496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0530817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0531138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0531480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0531921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0532216Z     )
2025-05-07T20:32:53.0532581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0533040Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0533287Z         self,
2025-05-07T20:32:53.0533543Z         T: int,
2025-05-07T20:32:53.0533760Z         D: int,
2025-05-07T20:32:53.0533982Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0534264Z         contiguous: bool,
2025-05-07T20:32:53.0534521Z         compiled: bool,
2025-05-07T20:32:53.0534748Z     ) -> None:
2025-05-07T20:32:53.0534971Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0535224Z     
2025-05-07T20:32:53.0535503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0535858Z     
2025-05-07T20:32:53.0536060Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0536355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0536677Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0536930Z         x0 = x[:, :D]
2025-05-07T20:32:53.0537157Z         x1 = x[:, D:]
2025-05-07T20:32:53.0537373Z     
2025-05-07T20:32:53.0537571Z         if contiguous:
2025-05-07T20:32:53.0537813Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0538081Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0538341Z     
2025-05-07T20:32:53.0538575Z         if scale_ub is not None:
2025-05-07T20:32:53.0538853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0539200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0539532Z             )
2025-05-07T20:32:53.0539742Z         else:
2025-05-07T20:32:53.0539956Z             scale_ub_tensor = None
2025-05-07T20:32:53.0540224Z     
2025-05-07T20:32:53.0540467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0540795Z             op = silu_mul_quant
2025-05-07T20:32:53.0541051Z             if compiled:
2025-05-07T20:32:53.0541312Z                 op = torch.compile(op)
2025-05-07T20:32:53.0541621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0541906Z     
2025-05-07T20:32:53.0542108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0542278Z 
2025-05-07T20:32:53.0542387Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0542695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0543044Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0543336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0543986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0544572Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0545258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0545972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0546524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0547281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0547994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0548576Z     kernel = self.compile(
2025-05-07T20:32:53.0549131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0549813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0550230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0550510Z 
2025-05-07T20:32:53.0550728Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fda630>
2025-05-07T20:32:53.0551852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0553279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f5a0c0>}
2025-05-07T20:32:53.0554721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0555808Z context = <triton._C.libtriton.ir.context object at 0x7fd871a149b0>
2025-05-07T20:32:53.0556111Z 
2025-05-07T20:32:53.0556285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0556837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0557332Z                            module_map=module_map)
2025-05-07T20:32:53.0557712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0558083Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0558352Z E       ^
2025-05-07T20:32:53.0558875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0559355Z 
2025-05-07T20:32:53.0559788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0560334Z 
2025-05-07T20:32:53.1786240Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1786700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1787161Z     T=16384,
2025-05-07T20:32:53.1787446Z     D=5120,
2025-05-07T20:32:53.1787689Z     scale_ub=1200.0,
2025-05-07T20:32:53.1787916Z     contiguous=True,
2025-05-07T20:32:53.1788143Z     compiled=False,
2025-05-07T20:32:53.1788384Z )
2025-05-07T20:32:53.1788735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1789265Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.1789556Z 
2025-05-07T20:32:53.1789643Z     @given(
2025-05-07T20:32:53.1789878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1790209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1790529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1790872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1791467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1791768Z     )
2025-05-07T20:32:53.1792131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1792581Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1792832Z         self,
2025-05-07T20:32:53.1793037Z         T: int,
2025-05-07T20:32:53.1793237Z         D: int,
2025-05-07T20:32:53.1793465Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1793744Z         contiguous: bool,
2025-05-07T20:32:53.1794081Z         compiled: bool,
2025-05-07T20:32:53.1794317Z     ) -> None:
2025-05-07T20:32:53.1794538Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1794781Z     
2025-05-07T20:32:53.1795062Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1795419Z     
2025-05-07T20:32:53.1795618Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1795910Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1796233Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1796481Z         x0 = x[:, :D]
2025-05-07T20:32:53.1796698Z         x1 = x[:, D:]
2025-05-07T20:32:53.1796912Z     
2025-05-07T20:32:53.1797188Z         if contiguous:
2025-05-07T20:32:53.1797423Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1797690Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1797937Z     
2025-05-07T20:32:53.1798130Z         if scale_ub is not None:
2025-05-07T20:32:53.1798413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1798762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1799074Z             )
2025-05-07T20:32:53.1799274Z         else:
2025-05-07T20:32:53.1799492Z             scale_ub_tensor = None
2025-05-07T20:32:53.1799822Z     
2025-05-07T20:32:53.1800060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1800386Z             op = silu_mul_quant
2025-05-07T20:32:53.1800646Z             if compiled:
2025-05-07T20:32:53.1800898Z                 op = torch.compile(op)
2025-05-07T20:32:53.1801202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1801485Z     
2025-05-07T20:32:53.1801678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1801856Z 
2025-05-07T20:32:53.1801960Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1802272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1802610Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1802902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1803621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1804337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1804890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1805597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1806558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1807110Z     kernel = self.compile(
2025-05-07T20:32:53.1807672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1808356Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1808771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1809009Z 
2025-05-07T20:32:53.1809224Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fda3f0>
2025-05-07T20:32:53.1810345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1811951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f59a80>}
2025-05-07T20:32:53.1813639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1814903Z context = <triton._C.libtriton.ir.context object at 0x7fd871ef3c30>
2025-05-07T20:32:53.1815243Z 
2025-05-07T20:32:53.1815428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1816073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1816556Z                            module_map=module_map)
2025-05-07T20:32:53.1816929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1817292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1817562Z E       ^
2025-05-07T20:32:53.1818091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1818564Z 
2025-05-07T20:32:53.1819054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1819595Z 
2025-05-07T20:32:53.1819701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1820131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1820546Z     T=1,
2025-05-07T20:32:53.1820738Z     D=7168,
2025-05-07T20:32:53.1820940Z     scale_ub=1200.0,
2025-05-07T20:32:53.1821167Z     contiguous=False,
2025-05-07T20:32:53.1821404Z     compiled=False,
2025-05-07T20:32:53.1821619Z )
2025-05-07T20:32:53.1822022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1822526Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.1822808Z 
2025-05-07T20:32:53.1822889Z     @given(
2025-05-07T20:32:53.1823128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1823447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1823768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1824106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1824437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1824736Z     )
2025-05-07T20:32:53.1825098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1825560Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1825807Z         self,
2025-05-07T20:32:53.1826012Z         T: int,
2025-05-07T20:32:53.1826222Z         D: int,
2025-05-07T20:32:53.1826448Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1826732Z         contiguous: bool,
2025-05-07T20:32:53.1826985Z         compiled: bool,
2025-05-07T20:32:53.1827211Z     ) -> None:
2025-05-07T20:32:53.1827437Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1827696Z     
2025-05-07T20:32:53.1827971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1828329Z     
2025-05-07T20:32:53.1828535Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1828830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1829151Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1829401Z         x0 = x[:, :D]
2025-05-07T20:32:53.1829618Z         x1 = x[:, D:]
2025-05-07T20:32:53.1829835Z     
2025-05-07T20:32:53.1830032Z         if contiguous:
2025-05-07T20:32:53.1830264Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1830533Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1830780Z     
2025-05-07T20:32:53.1830977Z         if scale_ub is not None:
2025-05-07T20:32:53.1831253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1831598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1831915Z             )
2025-05-07T20:32:53.1832107Z         else:
2025-05-07T20:32:53.1832378Z             scale_ub_tensor = None
2025-05-07T20:32:53.1832639Z     
2025-05-07T20:32:53.1832874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1833202Z             op = silu_mul_quant
2025-05-07T20:32:53.1833459Z             if compiled:
2025-05-07T20:32:53.1833705Z                 op = torch.compile(op)
2025-05-07T20:32:53.1834010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1834295Z     
2025-05-07T20:32:53.1834487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1834727Z 
2025-05-07T20:32:53.1834829Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1835133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1835478Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1835766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1836484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1837196Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1837796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1838510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1839249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1839802Z     kernel = self.compile(
2025-05-07T20:32:53.1840360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1841041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1841520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1841758Z 
2025-05-07T20:32:53.1841977Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec9eb0>
2025-05-07T20:32:53.1843107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1851597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c640e0>}
2025-05-07T20:32:53.1853073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1854144Z context = <triton._C.libtriton.ir.context object at 0x7fd871bca4b0>
2025-05-07T20:32:53.1854448Z 
2025-05-07T20:32:53.1854626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1855169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1855652Z                            module_map=module_map)
2025-05-07T20:32:53.1856036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1856398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1856667Z E       ^
2025-05-07T20:32:53.1857151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1857621Z 
2025-05-07T20:32:53.1858063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1858606Z 
2025-05-07T20:32:53.3583063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3583556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3584002Z     T=4096,
2025-05-07T20:32:53.3584197Z     D=7168,
2025-05-07T20:32:53.3584386Z     scale_ub=1200.0,
2025-05-07T20:32:53.3584615Z     contiguous=False,
2025-05-07T20:32:53.3585134Z     compiled=True,
2025-05-07T20:32:53.3585346Z )
2025-05-07T20:32:53.3585678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.3586206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.3586490Z 
2025-05-07T20:32:53.3586580Z     @given(
2025-05-07T20:32:53.3586813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.3587136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.3587552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.3587886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.3588223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.3588521Z     )
2025-05-07T20:32:53.3588877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.3589337Z     def test_silu_mul_quant(
2025-05-07T20:32:53.3589590Z         self,
2025-05-07T20:32:53.3589801Z         T: int,
2025-05-07T20:32:53.3590003Z         D: int,
2025-05-07T20:32:53.3590228Z         scale_ub: Optional[float],
2025-05-07T20:32:53.3590509Z         contiguous: bool,
2025-05-07T20:32:53.3590847Z         compiled: bool,
2025-05-07T20:32:53.3591089Z     ) -> None:
2025-05-07T20:32:53.3591313Z         torch.manual_seed(2025)
2025-05-07T20:32:53.3591557Z     
2025-05-07T20:32:53.3591840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.3592196Z     
2025-05-07T20:32:53.3592399Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.3592702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.3593024Z         x = x_sign * x_clamp
2025-05-07T20:32:53.3593269Z         x0 = x[:, :D]
2025-05-07T20:32:53.3593575Z         x1 = x[:, D:]
2025-05-07T20:32:53.3593794Z     
2025-05-07T20:32:53.3593984Z         if contiguous:
2025-05-07T20:32:53.3594228Z             x0 = x0.contiguous()
2025-05-07T20:32:53.3594500Z             x1 = x1.contiguous()
2025-05-07T20:32:53.3594748Z     
2025-05-07T20:32:53.3594950Z         if scale_ub is not None:
2025-05-07T20:32:53.3595232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.3595580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.3595896Z             )
2025-05-07T20:32:53.3596104Z         else:
2025-05-07T20:32:53.3596324Z             scale_ub_tensor = None
2025-05-07T20:32:53.3596579Z     
2025-05-07T20:32:53.3596827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.3597160Z             op = silu_mul_quant
2025-05-07T20:32:53.3597416Z             if compiled:
2025-05-07T20:32:53.3597673Z                 op = torch.compile(op)
2025-05-07T20:32:53.3597984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3598271Z     
2025-05-07T20:32:53.3598494Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.3598687Z 
2025-05-07T20:32:53.3598800Z moe/activation_test.py:117: 
2025-05-07T20:32:53.3599104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3599450Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.3599741Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3600325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.3600901Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.3601581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.3602293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.3602842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.3603553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.3604237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.3604839Z     kernel = self.compile(
2025-05-07T20:32:53.3605392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.3606073Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.3606743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3606982Z 
2025-05-07T20:32:53.3607205Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec9a90>
2025-05-07T20:32:53.3608406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.3609853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c65300>}
2025-05-07T20:32:53.3611311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.3612472Z context = <triton._C.libtriton.ir.context object at 0x7fd871c02fb0>
2025-05-07T20:32:53.3612768Z 
2025-05-07T20:32:53.3612940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.3613481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.3613965Z                            module_map=module_map)
2025-05-07T20:32:53.3614342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.3614767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.3615032Z E       ^
2025-05-07T20:32:53.3615508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.3615974Z 
2025-05-07T20:32:53.3616416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.3616945Z 
2025-05-07T20:32:53.3617052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3617478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3617892Z     T=128,
2025-05-07T20:32:53.3618078Z     D=7168,
2025-05-07T20:32:53.3618277Z     scale_ub=1200.0,
2025-05-07T20:32:53.3618514Z     contiguous=False,
2025-05-07T20:32:53.3618740Z     compiled=True,
2025-05-07T20:32:53.3618947Z )
2025-05-07T20:32:53.4529864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.4530528Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.4530876Z 
2025-05-07T20:32:53.4530963Z     @given(
2025-05-07T20:32:53.4531190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.4531512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.4531877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.4532210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.4532543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.4532833Z     )
2025-05-07T20:32:53.4533188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.4533638Z     def test_silu_mul_quant(
2025-05-07T20:32:53.4533888Z         self,
2025-05-07T20:32:53.4534096Z         T: int,
2025-05-07T20:32:53.4534295Z         D: int,
2025-05-07T20:32:53.4534526Z         scale_ub: Optional[float],
2025-05-07T20:32:53.4534801Z         contiguous: bool,
2025-05-07T20:32:53.4535040Z         compiled: bool,
2025-05-07T20:32:53.4535273Z     ) -> None:
2025-05-07T20:32:53.4535493Z         torch.manual_seed(2025)
2025-05-07T20:32:53.4535732Z     
2025-05-07T20:32:53.4536009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.4536620Z     
2025-05-07T20:32:53.4536821Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.4537112Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.4537428Z         x = x_sign * x_clamp
2025-05-07T20:32:53.4537674Z         x0 = x[:, :D]
2025-05-07T20:32:53.4537888Z         x1 = x[:, D:]
2025-05-07T20:32:53.4538101Z     
2025-05-07T20:32:53.4538291Z         if contiguous:
2025-05-07T20:32:53.4538547Z             x0 = x0.contiguous()
2025-05-07T20:32:53.4538836Z             x1 = x1.contiguous()
2025-05-07T20:32:53.4539177Z     
2025-05-07T20:32:53.4539368Z         if scale_ub is not None:
2025-05-07T20:32:53.4539643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.4539990Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.4540309Z             )
2025-05-07T20:32:53.4540510Z         else:
2025-05-07T20:32:53.4540730Z             scale_ub_tensor = None
2025-05-07T20:32:53.4540983Z     
2025-05-07T20:32:53.4541225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.4541550Z             op = silu_mul_quant
2025-05-07T20:32:53.4541811Z             if compiled:
2025-05-07T20:32:53.4542144Z                 op = torch.compile(op)
2025-05-07T20:32:53.4542450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4542734Z     
2025-05-07T20:32:53.4542928Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.4543103Z 
2025-05-07T20:32:53.4543206Z moe/activation_test.py:117: 
2025-05-07T20:32:53.4543519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4543857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.4544152Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4544736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.4545390Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.4546070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.4546785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.4547353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.4548096Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.4548856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.4549470Z     kernel = self.compile(
2025-05-07T20:32:53.4550038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.4550744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.4551192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4551436Z 
2025-05-07T20:32:53.4551654Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec93d0>
2025-05-07T20:32:53.4552793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.4554251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c66020>}
2025-05-07T20:32:53.4555652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.4556725Z context = <triton._C.libtriton.ir.context object at 0x7fd871dcfa70>
2025-05-07T20:32:53.4557023Z 
2025-05-07T20:32:53.4557196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.4557793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.4558282Z                            module_map=module_map)
2025-05-07T20:32:53.4558657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.4559024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.4559295Z E       ^
2025-05-07T20:32:53.4559778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.4560296Z 
2025-05-07T20:32:53.4560727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.4561265Z 
2025-05-07T20:32:53.4561373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.4561806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.4562229Z     T=2048,
2025-05-07T20:32:53.4562422Z     D=7168,
2025-05-07T20:32:53.4562627Z     scale_ub=None,
2025-05-07T20:32:53.4562854Z     contiguous=True,
2025-05-07T20:32:53.4563080Z     compiled=True,
2025-05-07T20:32:53.4563316Z )
2025-05-07T20:32:53.4563723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.4564235Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.4564519Z 
2025-05-07T20:32:53.4564599Z     @given(
2025-05-07T20:32:53.4564840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.4565170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.4565484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.4565830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.4566179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.4566515Z     )
2025-05-07T20:32:53.4566885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.4567348Z     def test_silu_mul_quant(
2025-05-07T20:32:53.4567596Z         self,
2025-05-07T20:32:53.4567804Z         T: int,
2025-05-07T20:32:53.4568012Z         D: int,
2025-05-07T20:32:53.4568235Z         scale_ub: Optional[float],
2025-05-07T20:32:53.4568520Z         contiguous: bool,
2025-05-07T20:32:53.4568774Z         compiled: bool,
2025-05-07T20:32:53.4569008Z     ) -> None:
2025-05-07T20:32:53.4569226Z         torch.manual_seed(2025)
2025-05-07T20:32:53.4569476Z     
2025-05-07T20:32:53.4569758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.4570107Z     
2025-05-07T20:32:53.4570307Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.4570607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.4570921Z         x = x_sign * x_clamp
2025-05-07T20:32:53.4571174Z         x0 = x[:, :D]
2025-05-07T20:32:53.4571402Z         x1 = x[:, D:]
2025-05-07T20:32:53.4571610Z     
2025-05-07T20:32:53.4571867Z         if contiguous:
2025-05-07T20:32:53.4572109Z             x0 = x0.contiguous()
2025-05-07T20:32:53.4572372Z             x1 = x1.contiguous()
2025-05-07T20:32:53.4572623Z     
2025-05-07T20:32:53.4572824Z         if scale_ub is not None:
2025-05-07T20:32:53.4573103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.4573450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.4573772Z             )
2025-05-07T20:32:53.4573966Z         else:
2025-05-07T20:32:53.4574184Z             scale_ub_tensor = None
2025-05-07T20:32:53.4574443Z     
2025-05-07T20:32:53.4574685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.4575006Z             op = silu_mul_quant
2025-05-07T20:32:53.4575266Z             if compiled:
2025-05-07T20:32:53.4575523Z                 op = torch.compile(op)
2025-05-07T20:32:53.4575827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4576113Z     
2025-05-07T20:32:53.4576316Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.4576485Z 
2025-05-07T20:32:53.4576638Z moe/activation_test.py:117: 
2025-05-07T20:32:53.4576950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4577295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.4577584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4578168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.4578753Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.4579440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.4580198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.4580758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.4581473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.4582170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.4582720Z     kernel = self.compile(
2025-05-07T20:32:53.4583326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.4584012Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.4584418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4584664Z 
2025-05-07T20:32:53.4584882Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871d7c650>
2025-05-07T20:32:53.4586006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.4587526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c67240>}
2025-05-07T20:32:53.4588926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.4589983Z context = <triton._C.libtriton.ir.context object at 0x7fd871dc7d70>
2025-05-07T20:32:53.4590288Z 
2025-05-07T20:32:53.4590460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.4591008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.4591497Z                            module_map=module_map)
2025-05-07T20:32:53.4591867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.4592235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.4592504Z E       ^
2025-05-07T20:32:53.4592982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.4593456Z 
2025-05-07T20:32:53.4593891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.4594431Z 
2025-05-07T20:32:53.5287859Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5288469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5288884Z     T=16384,
2025-05-07T20:32:53.5289091Z     D=5120,
2025-05-07T20:32:53.5289314Z     scale_ub=None,
2025-05-07T20:32:53.5289531Z     contiguous=False,
2025-05-07T20:32:53.5289773Z     compiled=False,
2025-05-07T20:32:53.5289991Z )
2025-05-07T20:32:53.5290322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5290863Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.5291155Z 
2025-05-07T20:32:53.5291248Z     @given(
2025-05-07T20:32:53.5291647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5292038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5292364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5292731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5293073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5293374Z     )
2025-05-07T20:32:53.5293749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5294210Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5294577Z         self,
2025-05-07T20:32:53.5294786Z         T: int,
2025-05-07T20:32:53.5294990Z         D: int,
2025-05-07T20:32:53.5295223Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5295517Z         contiguous: bool,
2025-05-07T20:32:53.5295764Z         compiled: bool,
2025-05-07T20:32:53.5296009Z     ) -> None:
2025-05-07T20:32:53.5296241Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5296500Z     
2025-05-07T20:32:53.5296786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5297144Z     
2025-05-07T20:32:53.5297351Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5297738Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5299918Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5301977Z 
2025-05-07T20:32:53.5302102Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.5302325Z 
2025-05-07T20:32:53.5302444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5302878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5303293Z     T=4096,
2025-05-07T20:32:53.5303492Z     D=7168,
2025-05-07T20:32:53.5303692Z     scale_ub=1200.0,
2025-05-07T20:32:53.5303917Z     contiguous=True,
2025-05-07T20:32:53.5304146Z     compiled=True,
2025-05-07T20:32:53.5304355Z )
2025-05-07T20:32:53.5304682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5305212Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.5305498Z 
2025-05-07T20:32:53.5305588Z     @given(
2025-05-07T20:32:53.5305821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5306305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5306629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5306974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5307314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5307611Z     )
2025-05-07T20:32:53.5307976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5308434Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5308690Z         self,
2025-05-07T20:32:53.5308894Z         T: int,
2025-05-07T20:32:53.5309096Z         D: int,
2025-05-07T20:32:53.5309325Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5309608Z         contiguous: bool,
2025-05-07T20:32:53.5309856Z         compiled: bool,
2025-05-07T20:32:53.5310088Z     ) -> None:
2025-05-07T20:32:53.5310311Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5310554Z     
2025-05-07T20:32:53.5310834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5311185Z     
2025-05-07T20:32:53.5311378Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5311681Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5313851Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5315866Z 
2025-05-07T20:32:53.5315990Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.5316210Z 
2025-05-07T20:32:53.5316323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5316748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5317171Z     T=16384,
2025-05-07T20:32:53.5317371Z     D=7168,
2025-05-07T20:32:53.5317566Z     scale_ub=None,
2025-05-07T20:32:53.5317792Z     contiguous=False,
2025-05-07T20:32:53.5318030Z     compiled=False,
2025-05-07T20:32:53.5318272Z )
2025-05-07T20:32:53.5318677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5319198Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.5319490Z 
2025-05-07T20:32:53.5319578Z     @given(
2025-05-07T20:32:53.5319807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5320132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5320451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5320784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5321124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5321487Z     )
2025-05-07T20:32:53.5321844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5322308Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5322556Z         self,
2025-05-07T20:32:53.5330891Z         T: int,
2025-05-07T20:32:53.5331116Z         D: int,
2025-05-07T20:32:53.5331344Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5331633Z         contiguous: bool,
2025-05-07T20:32:53.5331960Z         compiled: bool,
2025-05-07T20:32:53.5332199Z     ) -> None:
2025-05-07T20:32:53.5332432Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5332702Z     
2025-05-07T20:32:53.5332987Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5335154Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5337132Z 
2025-05-07T20:32:53.5337258Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.5337482Z 
2025-05-07T20:32:53.5337597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5338035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5338457Z     T=2048,
2025-05-07T20:32:53.5338652Z     D=7168,
2025-05-07T20:32:53.5338853Z     scale_ub=1200.0,
2025-05-07T20:32:53.5339087Z     contiguous=True,
2025-05-07T20:32:53.5339315Z     compiled=True,
2025-05-07T20:32:53.5339526Z )
2025-05-07T20:32:53.5339852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5340376Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.5340660Z 
2025-05-07T20:32:53.5340745Z     @given(
2025-05-07T20:32:53.5340977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5341376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5341698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5342042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5342377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5342673Z     )
2025-05-07T20:32:53.5343037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5343489Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5343796Z         self,
2025-05-07T20:32:53.5343999Z         T: int,
2025-05-07T20:32:53.5344197Z         D: int,
2025-05-07T20:32:53.5344424Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5344703Z         contiguous: bool,
2025-05-07T20:32:53.5344947Z         compiled: bool,
2025-05-07T20:32:53.5345175Z     ) -> None:
2025-05-07T20:32:53.5345400Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5345645Z     
2025-05-07T20:32:53.5345932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5346285Z     
2025-05-07T20:32:53.5346482Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5346833Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5348969Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.5350952Z 
2025-05-07T20:32:53.5351073Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.5351292Z 
2025-05-07T20:32:53.5351408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5351839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5352261Z     T=2048,
2025-05-07T20:32:53.5352459Z     D=7168,
2025-05-07T20:32:53.5352655Z     scale_ub=None,
2025-05-07T20:32:53.5352886Z     contiguous=True,
2025-05-07T20:32:53.5353120Z     compiled=False,
2025-05-07T20:32:53.5353330Z )
2025-05-07T20:32:53.6479314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6479919Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.6480214Z 
2025-05-07T20:32:53.6480311Z     @given(
2025-05-07T20:32:53.6480550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6480882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6481216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6481565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6481904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6482209Z     )
2025-05-07T20:32:53.6482585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6483053Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6483317Z         self,
2025-05-07T20:32:53.6483530Z         T: int,
2025-05-07T20:32:53.6483738Z         D: int,
2025-05-07T20:32:53.6483973Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6484267Z         contiguous: bool,
2025-05-07T20:32:53.6484518Z         compiled: bool,
2025-05-07T20:32:53.6484766Z     ) -> None:
2025-05-07T20:32:53.6485000Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6485255Z     
2025-05-07T20:32:53.6485552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6485923Z     
2025-05-07T20:32:53.6486138Z >       x_sign = torch.sign(x)
2025-05-07T20:32:53.6488293Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.6490246Z 
2025-05-07T20:32:53.6490374Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:53.6490672Z 
2025-05-07T20:32:53.6490781Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6491221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6491642Z     T=1,
2025-05-07T20:32:53.6491883Z     D=7168,
2025-05-07T20:32:53.6492088Z     scale_ub=1200.0,
2025-05-07T20:32:53.6492317Z     contiguous=True,
2025-05-07T20:32:53.6492553Z     compiled=False,
2025-05-07T20:32:53.6492770Z )
2025-05-07T20:32:53.6493111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6493620Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.6493975Z 
2025-05-07T20:32:53.6494060Z     @given(
2025-05-07T20:32:53.6494302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6494626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6494952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6495313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6495656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6495955Z     )
2025-05-07T20:32:53.6496324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6496865Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6497118Z         self,
2025-05-07T20:32:53.6497326Z         T: int,
2025-05-07T20:32:53.6497539Z         D: int,
2025-05-07T20:32:53.6497768Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6498088Z         contiguous: bool,
2025-05-07T20:32:53.6498364Z         compiled: bool,
2025-05-07T20:32:53.6498595Z     ) -> None:
2025-05-07T20:32:53.6498827Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6499082Z     
2025-05-07T20:32:53.6499369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6499727Z     
2025-05-07T20:32:53.6499933Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.6500232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.6500562Z         x = x_sign * x_clamp
2025-05-07T20:32:53.6500815Z         x0 = x[:, :D]
2025-05-07T20:32:53.6501038Z         x1 = x[:, D:]
2025-05-07T20:32:53.6501257Z     
2025-05-07T20:32:53.6501457Z         if contiguous:
2025-05-07T20:32:53.6501696Z             x0 = x0.contiguous()
2025-05-07T20:32:53.6501969Z             x1 = x1.contiguous()
2025-05-07T20:32:53.6502225Z     
2025-05-07T20:32:53.6502429Z         if scale_ub is not None:
2025-05-07T20:32:53.6502713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.6503065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.6503393Z             )
2025-05-07T20:32:53.6503593Z         else:
2025-05-07T20:32:53.6503818Z             scale_ub_tensor = None
2025-05-07T20:32:53.6504083Z     
2025-05-07T20:32:53.6504322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.6504653Z             op = silu_mul_quant
2025-05-07T20:32:53.6504919Z             if compiled:
2025-05-07T20:32:53.6505175Z                 op = torch.compile(op)
2025-05-07T20:32:53.6505486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6505776Z     
2025-05-07T20:32:53.6505974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.6506300Z 
2025-05-07T20:32:53.6506407Z moe/activation_test.py:117: 
2025-05-07T20:32:53.6506719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6507146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.6507442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6508175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.6508907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.6509469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.6510188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.6510968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.6511534Z     kernel = self.compile(
2025-05-07T20:32:53.6512104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.6512807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.6513230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6513475Z 
2025-05-07T20:32:53.6513765Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871d7fb90>
2025-05-07T20:32:53.6514897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.6516348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871da2520>}
2025-05-07T20:32:53.6517769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.6518983Z context = <triton._C.libtriton.ir.context object at 0x7fd871b75030>
2025-05-07T20:32:53.6519291Z 
2025-05-07T20:32:53.6519467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.6520033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.6520531Z                            module_map=module_map)
2025-05-07T20:32:53.6520922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.6521292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.6521572Z E       ^
2025-05-07T20:32:53.6522067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.6522546Z 
2025-05-07T20:32:53.6522988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.6523541Z 
2025-05-07T20:32:53.6523649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6524090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6524523Z     T=128,
2025-05-07T20:32:53.6524716Z     D=5120,
2025-05-07T20:32:53.6524919Z     scale_ub=None,
2025-05-07T20:32:53.6525153Z     contiguous=True,
2025-05-07T20:32:53.6525384Z     compiled=False,
2025-05-07T20:32:53.6525603Z )
2025-05-07T20:32:53.7202131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7202811Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.7203122Z 
2025-05-07T20:32:53.7203213Z     @given(
2025-05-07T20:32:53.7203452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7203782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7204114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7204468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7204823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7205116Z     )
2025-05-07T20:32:53.7205611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7206088Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7206576Z         self,
2025-05-07T20:32:53.7206793Z         T: int,
2025-05-07T20:32:53.7207009Z         D: int,
2025-05-07T20:32:53.7207229Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7207520Z         contiguous: bool,
2025-05-07T20:32:53.7207774Z         compiled: bool,
2025-05-07T20:32:53.7208010Z     ) -> None:
2025-05-07T20:32:53.7208324Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7208590Z     
2025-05-07T20:32:53.7208880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7209252Z     
2025-05-07T20:32:53.7209468Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.7209783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.7210110Z         x = x_sign * x_clamp
2025-05-07T20:32:53.7210370Z         x0 = x[:, :D]
2025-05-07T20:32:53.7210613Z         x1 = x[:, D:]
2025-05-07T20:32:53.7210830Z     
2025-05-07T20:32:53.7211031Z         if contiguous:
2025-05-07T20:32:53.7211284Z             x0 = x0.contiguous()
2025-05-07T20:32:53.7211619Z             x1 = x1.contiguous()
2025-05-07T20:32:53.7211939Z     
2025-05-07T20:32:53.7212142Z         if scale_ub is not None:
2025-05-07T20:32:53.7212424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.7212779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.7213107Z             )
2025-05-07T20:32:53.7213306Z         else:
2025-05-07T20:32:53.7213529Z             scale_ub_tensor = None
2025-05-07T20:32:53.7213797Z     
2025-05-07T20:32:53.7214036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.7214442Z             op = silu_mul_quant
2025-05-07T20:32:53.7214711Z             if compiled:
2025-05-07T20:32:53.7214972Z                 op = torch.compile(op)
2025-05-07T20:32:53.7215281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7215570Z     
2025-05-07T20:32:53.7215774Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.7215947Z 
2025-05-07T20:32:53.7216053Z moe/activation_test.py:117: 
2025-05-07T20:32:53.7216374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7216726Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.7217016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7217749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.7218489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.7219061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.7219785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.7220496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.7221070Z     kernel = self.compile(
2025-05-07T20:32:53.7221641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.7222339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.7222765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7223008Z 
2025-05-07T20:32:53.7223232Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b25e80>
2025-05-07T20:32:53.7224376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.7225905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871da3420>}
2025-05-07T20:32:53.7227340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.7228425Z context = <triton._C.libtriton.ir.context object at 0x7fd871bcedb0>
2025-05-07T20:32:53.7228729Z 
2025-05-07T20:32:53.7228910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.7229534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.7230028Z                            module_map=module_map)
2025-05-07T20:32:53.7230412Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.7230782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.7231059Z E       ^
2025-05-07T20:32:53.7231555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.7232033Z 
2025-05-07T20:32:53.7232528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.7233072Z 
2025-05-07T20:32:53.7233181Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7233621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7234051Z     T=128,
2025-05-07T20:32:53.7234246Z     D=7168,
2025-05-07T20:32:53.7234452Z     scale_ub=None,
2025-05-07T20:32:53.7234681Z     contiguous=True,
2025-05-07T20:32:53.7234913Z     compiled=False,
2025-05-07T20:32:53.7235130Z )
2025-05-07T20:32:53.7235468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7236033Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.7236314Z 
2025-05-07T20:32:53.7236395Z     @given(
2025-05-07T20:32:53.7236635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7236963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7237283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7237632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7237979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7238311Z     )
2025-05-07T20:32:53.7238699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7239166Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7239422Z         self,
2025-05-07T20:32:53.7239618Z         T: int,
2025-05-07T20:32:53.7239821Z         D: int,
2025-05-07T20:32:53.7240048Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7240322Z         contiguous: bool,
2025-05-07T20:32:53.7240569Z         compiled: bool,
2025-05-07T20:32:53.7240796Z     ) -> None:
2025-05-07T20:32:53.7241013Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7241268Z     
2025-05-07T20:32:53.7241554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7241912Z     
2025-05-07T20:32:53.7242121Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.7242430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.7242750Z         x = x_sign * x_clamp
2025-05-07T20:32:53.7243004Z         x0 = x[:, :D]
2025-05-07T20:32:53.7243233Z         x1 = x[:, D:]
2025-05-07T20:32:53.7243446Z     
2025-05-07T20:32:53.7243646Z         if contiguous:
2025-05-07T20:32:53.7243894Z             x0 = x0.contiguous()
2025-05-07T20:32:53.7244161Z             x1 = x1.contiguous()
2025-05-07T20:32:53.7244413Z     
2025-05-07T20:32:53.7244616Z         if scale_ub is not None:
2025-05-07T20:32:53.7244903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.7245250Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.7245575Z             )
2025-05-07T20:32:53.7245782Z         else:
2025-05-07T20:32:53.7245997Z             scale_ub_tensor = None
2025-05-07T20:32:53.7246310Z     
2025-05-07T20:32:53.7246553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.7246881Z             op = silu_mul_quant
2025-05-07T20:32:53.7247145Z             if compiled:
2025-05-07T20:32:53.7247406Z                 op = torch.compile(op)
2025-05-07T20:32:53.7247714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7248003Z     
2025-05-07T20:32:53.7248226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.7248421Z 
2025-05-07T20:32:53.7248525Z moe/activation_test.py:117: 
2025-05-07T20:32:53.7248907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7249288Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.7249603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7250429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.7251267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.7251951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.7252714Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.7253421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.7253988Z     kernel = self.compile(
2025-05-07T20:32:53.7254560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.7255249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.7255671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7255957Z 
2025-05-07T20:32:53.7256183Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b26300>
2025-05-07T20:32:53.7257337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.7258787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd8719484a0>}
2025-05-07T20:32:53.7260217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.7261313Z context = <triton._C.libtriton.ir.context object at 0x7fd8719e5470>
2025-05-07T20:32:53.7261616Z 
2025-05-07T20:32:53.7261799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.7262351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.7262847Z                            module_map=module_map)
2025-05-07T20:32:53.7263232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.7263602Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.7263871Z E       ^
2025-05-07T20:32:53.7264362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.7264841Z 
2025-05-07T20:32:53.7265284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.7265828Z 
2025-05-07T20:32:53.7265940Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7266374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7266806Z     T=2048,
2025-05-07T20:32:53.7267008Z     D=7168,
2025-05-07T20:32:53.7267203Z     scale_ub=1200.0,
2025-05-07T20:32:53.7267440Z     contiguous=True,
2025-05-07T20:32:53.7267672Z     compiled=False,
2025-05-07T20:32:53.7267887Z )
2025-05-07T20:32:53.8072990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8074557Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.8075361Z 
2025-05-07T20:32:53.8075582Z     @given(
2025-05-07T20:32:53.8076079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8076723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8077369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8078124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8078524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8078819Z     )
2025-05-07T20:32:53.8079191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8079664Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8079917Z         self,
2025-05-07T20:32:53.8080127Z         T: int,
2025-05-07T20:32:53.8080337Z         D: int,
2025-05-07T20:32:53.8080566Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8080857Z         contiguous: bool,
2025-05-07T20:32:53.8081116Z         compiled: bool,
2025-05-07T20:32:53.8081349Z     ) -> None:
2025-05-07T20:32:53.8081646Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8081904Z     
2025-05-07T20:32:53.8082188Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8084343Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8086355Z 
2025-05-07T20:32:53.8086484Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8086713Z 
2025-05-07T20:32:53.8086825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8087262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8087683Z     T=1,
2025-05-07T20:32:53.8087878Z     D=5120,
2025-05-07T20:32:53.8088083Z     scale_ub=1200.0,
2025-05-07T20:32:53.8088314Z     contiguous=True,
2025-05-07T20:32:53.8088551Z     compiled=False,
2025-05-07T20:32:53.8088775Z )
2025-05-07T20:32:53.8089107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8089624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.8089911Z 
2025-05-07T20:32:53.8089995Z     @given(
2025-05-07T20:32:53.8090245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8090570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8090909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8091263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8091609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8091988Z     )
2025-05-07T20:32:53.8092362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8092824Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8093089Z         self,
2025-05-07T20:32:53.8093302Z         T: int,
2025-05-07T20:32:53.8093510Z         D: int,
2025-05-07T20:32:53.8093745Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8101020Z         contiguous: bool,
2025-05-07T20:32:53.8101303Z         compiled: bool,
2025-05-07T20:32:53.8101542Z     ) -> None:
2025-05-07T20:32:53.8101778Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8102047Z     
2025-05-07T20:32:53.8102353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8102738Z     
2025-05-07T20:32:53.8102946Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8103345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8103672Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8103921Z         x0 = x[:, :D]
2025-05-07T20:32:53.8104147Z         x1 = x[:, D:]
2025-05-07T20:32:53.8104364Z     
2025-05-07T20:32:53.8104552Z         if contiguous:
2025-05-07T20:32:53.8104794Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8105065Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8105308Z     
2025-05-07T20:32:53.8105551Z         if scale_ub is not None:
2025-05-07T20:32:53.8105834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8106368Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8106700Z             )
2025-05-07T20:32:53.8106904Z         else:
2025-05-07T20:32:53.8107123Z             scale_ub_tensor = None
2025-05-07T20:32:53.8107385Z     
2025-05-07T20:32:53.8107628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8107959Z             op = silu_mul_quant
2025-05-07T20:32:53.8108223Z             if compiled:
2025-05-07T20:32:53.8108481Z                 op = torch.compile(op)
2025-05-07T20:32:53.8108875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8109166Z     
2025-05-07T20:32:53.8109367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8109539Z 
2025-05-07T20:32:53.8109653Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8109960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8110314Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8110609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8111331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8112117Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8112681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8113403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8114099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8114661Z     kernel = self.compile(
2025-05-07T20:32:53.8115229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8115909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8116333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8116578Z 
2025-05-07T20:32:53.8116796Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b26b70>
2025-05-07T20:32:53.8117937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8119386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871949a80>}
2025-05-07T20:32:53.8120796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8121871Z context = <triton._C.libtriton.ir.context object at 0x7fd871914570>
2025-05-07T20:32:53.8122180Z 
2025-05-07T20:32:53.8122355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8122905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8123394Z                            module_map=module_map)
2025-05-07T20:32:53.8123784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8124254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8124523Z E       ^
2025-05-07T20:32:53.8125009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8125485Z 
2025-05-07T20:32:53.8125923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8126461Z 
2025-05-07T20:32:53.8126573Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8127064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8127493Z     T=2048,
2025-05-07T20:32:53.8127689Z     D=5120,
2025-05-07T20:32:53.8127889Z     scale_ub=None,
2025-05-07T20:32:53.8128156Z     contiguous=True,
2025-05-07T20:32:53.8128402Z     compiled=False,
2025-05-07T20:32:53.8128611Z )
2025-05-07T20:32:53.8128942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8129466Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.8129752Z 
2025-05-07T20:32:53.8129838Z     @given(
2025-05-07T20:32:53.8130119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8130475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8130820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8131190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8131565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8131943Z     )
2025-05-07T20:32:53.8132307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8132772Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8133027Z         self,
2025-05-07T20:32:53.8133277Z         T: int,
2025-05-07T20:32:53.8133485Z         D: int,
2025-05-07T20:32:53.8133713Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8133996Z         contiguous: bool,
2025-05-07T20:32:53.8134244Z         compiled: bool,
2025-05-07T20:32:53.8134478Z     ) -> None:
2025-05-07T20:32:53.8134701Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8134950Z     
2025-05-07T20:32:53.8135234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8135594Z     
2025-05-07T20:32:53.8135794Z >       x_sign = torch.sign(x)
2025-05-07T20:32:53.8137840Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8139805Z 
2025-05-07T20:32:53.8139932Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:53.8140152Z 
2025-05-07T20:32:53.8140263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8140699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8141117Z     T=16384,
2025-05-07T20:32:53.8141318Z     D=5120,
2025-05-07T20:32:53.8141517Z     scale_ub=None,
2025-05-07T20:32:53.8141735Z     contiguous=True,
2025-05-07T20:32:53.8141964Z     compiled=False,
2025-05-07T20:32:53.8142176Z )
2025-05-07T20:32:53.8884758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8885584Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.8885981Z 
2025-05-07T20:32:53.8886086Z     @given(
2025-05-07T20:32:53.8886412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8886791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8887106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8887567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8887909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8888200Z     )
2025-05-07T20:32:53.8888577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8889044Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8889290Z         self,
2025-05-07T20:32:53.8889492Z         T: int,
2025-05-07T20:32:53.8889699Z         D: int,
2025-05-07T20:32:53.8889926Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8890276Z         contiguous: bool,
2025-05-07T20:32:53.8890535Z         compiled: bool,
2025-05-07T20:32:53.8890768Z     ) -> None:
2025-05-07T20:32:53.8890992Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8891253Z     
2025-05-07T20:32:53.8891531Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8893850Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8895839Z 
2025-05-07T20:32:53.8895966Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8896191Z 
2025-05-07T20:32:53.8896304Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8896741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8897228Z     T=4096,
2025-05-07T20:32:53.8897424Z     D=5120,
2025-05-07T20:32:53.8897627Z     scale_ub=None,
2025-05-07T20:32:53.8897850Z     contiguous=True,
2025-05-07T20:32:53.8898112Z     compiled=False,
2025-05-07T20:32:53.8898357Z )
2025-05-07T20:32:53.8898688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8899217Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.8899510Z 
2025-05-07T20:32:53.8899593Z     @given(
2025-05-07T20:32:53.8899835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8900158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8900481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8900837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8901183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8901492Z     )
2025-05-07T20:32:53.8901861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8902328Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8902582Z         self,
2025-05-07T20:32:53.8902793Z         T: int,
2025-05-07T20:32:53.8903004Z         D: int,
2025-05-07T20:32:53.8903227Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8903516Z         contiguous: bool,
2025-05-07T20:32:53.8903773Z         compiled: bool,
2025-05-07T20:32:53.8904008Z     ) -> None:
2025-05-07T20:32:53.8904234Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8904485Z     
2025-05-07T20:32:53.8904763Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8907259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8909289Z 
2025-05-07T20:32:53.8909415Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8909642Z 
2025-05-07T20:32:53.8909754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8910231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8910697Z     T=2048,
2025-05-07T20:32:53.8910897Z     D=5120,
2025-05-07T20:32:53.8911111Z     scale_ub=None,
2025-05-07T20:32:53.8911343Z     contiguous=False,
2025-05-07T20:32:53.8911647Z     compiled=False,
2025-05-07T20:32:53.8911863Z )
2025-05-07T20:32:53.8912197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8912725Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.8913015Z 
2025-05-07T20:32:53.8913102Z     @given(
2025-05-07T20:32:53.8913336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8913668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8913998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8914351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8914759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8915059Z     )
2025-05-07T20:32:53.8915423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8915879Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8916133Z         self,
2025-05-07T20:32:53.8916336Z         T: int,
2025-05-07T20:32:53.8916546Z         D: int,
2025-05-07T20:32:53.8916775Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8917052Z         contiguous: bool,
2025-05-07T20:32:53.8917299Z         compiled: bool,
2025-05-07T20:32:53.8917531Z     ) -> None:
2025-05-07T20:32:53.8917814Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8918069Z     
2025-05-07T20:32:53.8918355Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8920514Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8922480Z 
2025-05-07T20:32:53.8922610Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8922833Z 
2025-05-07T20:32:53.8922943Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8923377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8923801Z     T=4096,
2025-05-07T20:32:53.8923988Z     D=7168,
2025-05-07T20:32:53.8924185Z     scale_ub=None,
2025-05-07T20:32:53.8924409Z     contiguous=True,
2025-05-07T20:32:53.8924633Z     compiled=True,
2025-05-07T20:32:53.8924848Z )
2025-05-07T20:32:53.8925184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8925697Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.8925986Z 
2025-05-07T20:32:53.8926067Z     @given(
2025-05-07T20:32:53.8926329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8926661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8926981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8927327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8927671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8927969Z     )
2025-05-07T20:32:53.8928385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8928849Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8929100Z         self,
2025-05-07T20:32:53.8929348Z         T: int,
2025-05-07T20:32:53.8929554Z         D: int,
2025-05-07T20:32:53.8929780Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8930057Z         contiguous: bool,
2025-05-07T20:32:53.8930304Z         compiled: bool,
2025-05-07T20:32:53.8930535Z     ) -> None:
2025-05-07T20:32:53.8930753Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8931005Z     
2025-05-07T20:32:53.8931285Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8933576Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8935553Z 
2025-05-07T20:32:53.8935682Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8935948Z 
2025-05-07T20:32:53.8936056Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8936488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8936913Z     T=2048,
2025-05-07T20:32:53.8937101Z     D=5120,
2025-05-07T20:32:53.8937298Z     scale_ub=1200.0,
2025-05-07T20:32:53.8937537Z     contiguous=False,
2025-05-07T20:32:53.8937765Z     compiled=False,
2025-05-07T20:32:53.8937974Z )
2025-05-07T20:32:53.8938356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8938951Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.8939249Z 
2025-05-07T20:32:53.8939329Z     @given(
2025-05-07T20:32:53.8939568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8939895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8940213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8940557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8940901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8941194Z     )
2025-05-07T20:32:53.8941557Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8942021Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8942269Z         self,
2025-05-07T20:32:53.8942471Z         T: int,
2025-05-07T20:32:53.8942673Z         D: int,
2025-05-07T20:32:53.8942896Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8943178Z         contiguous: bool,
2025-05-07T20:32:53.8943431Z         compiled: bool,
2025-05-07T20:32:53.8943656Z     ) -> None:
2025-05-07T20:32:53.8943878Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8944125Z     
2025-05-07T20:32:53.8944413Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8946580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.8948546Z 
2025-05-07T20:32:53.8948668Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:53.8948898Z 
2025-05-07T20:32:53.8949004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8949435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8949856Z     T=4096,
2025-05-07T20:32:53.8950049Z     D=7168,
2025-05-07T20:32:53.8950294Z     scale_ub=1200.0,
2025-05-07T20:32:53.8950528Z     contiguous=True,
2025-05-07T20:32:53.8950755Z     compiled=False,
2025-05-07T20:32:53.8950967Z )
2025-05-07T20:32:54.0019900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0020702Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.0021088Z 
2025-05-07T20:32:54.0021195Z     @given(
2025-05-07T20:32:54.0021464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0021896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0022216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0022555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0022906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0023202Z     )
2025-05-07T20:32:54.0023560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0024029Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0024286Z         self,
2025-05-07T20:32:54.0024489Z         T: int,
2025-05-07T20:32:54.0024703Z         D: int,
2025-05-07T20:32:54.0025004Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0025287Z         contiguous: bool,
2025-05-07T20:32:54.0025539Z         compiled: bool,
2025-05-07T20:32:54.0025776Z     ) -> None:
2025-05-07T20:32:54.0025999Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0026260Z     
2025-05-07T20:32:54.0026558Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0028782Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0030801Z 
2025-05-07T20:32:54.0030936Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0031158Z 
2025-05-07T20:32:54.0031266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0031701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0032124Z     T=16384,
2025-05-07T20:32:54.0032330Z     D=7168,
2025-05-07T20:32:54.0032532Z     scale_ub=None,
2025-05-07T20:32:54.0032761Z     contiguous=False,
2025-05-07T20:32:54.0032991Z     compiled=True,
2025-05-07T20:32:54.0033202Z )
2025-05-07T20:32:54.0033542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0034076Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.0034367Z 
2025-05-07T20:32:54.0034452Z     @given(
2025-05-07T20:32:54.0034699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0035029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0035348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0035694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0036044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0036341Z     )
2025-05-07T20:32:54.0036707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0037182Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0037439Z         self,
2025-05-07T20:32:54.0037640Z         T: int,
2025-05-07T20:32:54.0037851Z         D: int,
2025-05-07T20:32:54.0038081Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0038393Z         contiguous: bool,
2025-05-07T20:32:54.0038673Z         compiled: bool,
2025-05-07T20:32:54.0038904Z     ) -> None:
2025-05-07T20:32:54.0039127Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0039454Z     
2025-05-07T20:32:54.0039739Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0041884Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0043880Z 
2025-05-07T20:32:54.0044008Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0044239Z 
2025-05-07T20:32:54.0044345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0044779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0045214Z     T=4096,
2025-05-07T20:32:54.0045408Z     D=7168,
2025-05-07T20:32:54.0045610Z     scale_ub=None,
2025-05-07T20:32:54.0045837Z     contiguous=True,
2025-05-07T20:32:54.0046114Z     compiled=False,
2025-05-07T20:32:54.0046342Z )
2025-05-07T20:32:54.0046680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0047195Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.0047489Z 
2025-05-07T20:32:54.0047575Z     @given(
2025-05-07T20:32:54.0047820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0048145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0048469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0048863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0049213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0049509Z     )
2025-05-07T20:32:54.0049880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0050345Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0050594Z         self,
2025-05-07T20:32:54.0050806Z         T: int,
2025-05-07T20:32:54.0051016Z         D: int,
2025-05-07T20:32:54.0051242Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0051525Z         contiguous: bool,
2025-05-07T20:32:54.0051777Z         compiled: bool,
2025-05-07T20:32:54.0052065Z     ) -> None:
2025-05-07T20:32:54.0052294Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0052551Z     
2025-05-07T20:32:54.0052829Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0054991Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0056946Z 
2025-05-07T20:32:54.0057071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0057301Z 
2025-05-07T20:32:54.0057409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0057848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0058318Z     T=16384,
2025-05-07T20:32:54.0058522Z     D=7168,
2025-05-07T20:32:54.0058728Z     scale_ub=None,
2025-05-07T20:32:54.0058949Z     contiguous=True,
2025-05-07T20:32:54.0059190Z     compiled=False,
2025-05-07T20:32:54.0059411Z )
2025-05-07T20:32:54.0059740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0060264Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.0060614Z 
2025-05-07T20:32:54.0060698Z     @given(
2025-05-07T20:32:54.0060942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0061284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0061603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0061948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0062301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0062599Z     )
2025-05-07T20:32:54.0063014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0063479Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0063726Z         self,
2025-05-07T20:32:54.0063936Z         T: int,
2025-05-07T20:32:54.0064156Z         D: int,
2025-05-07T20:32:54.0064384Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0064681Z         contiguous: bool,
2025-05-07T20:32:54.0064943Z         compiled: bool,
2025-05-07T20:32:54.0071664Z     ) -> None:
2025-05-07T20:32:54.0071902Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0072156Z     
2025-05-07T20:32:54.0072439Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0074647Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0076646Z 
2025-05-07T20:32:54.0076778Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0077002Z 
2025-05-07T20:32:54.0077108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0077541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0077964Z     T=16384,
2025-05-07T20:32:54.0078162Z     D=7168,
2025-05-07T20:32:54.0078365Z     scale_ub=1200.0,
2025-05-07T20:32:54.0078629Z     contiguous=True,
2025-05-07T20:32:54.0078869Z     compiled=False,
2025-05-07T20:32:54.0079081Z )
2025-05-07T20:32:54.0079410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0079926Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.0080218Z 
2025-05-07T20:32:54.0080300Z     @given(
2025-05-07T20:32:54.0080538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0080860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0081179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0081523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0081863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0082155Z     )
2025-05-07T20:32:54.0082519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0082977Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0083223Z         self,
2025-05-07T20:32:54.0083430Z         T: int,
2025-05-07T20:32:54.0083638Z         D: int,
2025-05-07T20:32:54.0083860Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0084135Z         contiguous: bool,
2025-05-07T20:32:54.0084380Z         compiled: bool,
2025-05-07T20:32:54.0084611Z     ) -> None:
2025-05-07T20:32:54.0084829Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0085082Z     
2025-05-07T20:32:54.0085362Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0088185Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0090144Z 
2025-05-07T20:32:54.0090267Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0090493Z 
2025-05-07T20:32:54.0090601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0091078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0091500Z     T=128,
2025-05-07T20:32:54.0091692Z     D=5120,
2025-05-07T20:32:54.0091948Z     scale_ub=1200.0,
2025-05-07T20:32:54.0092182Z     contiguous=False,
2025-05-07T20:32:54.0092408Z     compiled=False,
2025-05-07T20:32:54.0092619Z )
2025-05-07T20:32:54.1377652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1378408Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.1378800Z 
2025-05-07T20:32:54.1378910Z     @given(
2025-05-07T20:32:54.1379399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1379772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1380085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1380433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1380802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1381104Z     )
2025-05-07T20:32:54.1381462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1381927Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1382263Z         self,
2025-05-07T20:32:54.1382462Z         T: int,
2025-05-07T20:32:54.1382673Z         D: int,
2025-05-07T20:32:54.1382906Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1383182Z         contiguous: bool,
2025-05-07T20:32:54.1383433Z         compiled: bool,
2025-05-07T20:32:54.1383672Z     ) -> None:
2025-05-07T20:32:54.1383893Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1384146Z     
2025-05-07T20:32:54.1384430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1384787Z     
2025-05-07T20:32:54.1384985Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1385290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1385610Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1385860Z         x0 = x[:, :D]
2025-05-07T20:32:54.1386082Z         x1 = x[:, D:]
2025-05-07T20:32:54.1386296Z     
2025-05-07T20:32:54.1386487Z         if contiguous:
2025-05-07T20:32:54.1386730Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1387006Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1387251Z     
2025-05-07T20:32:54.1387455Z         if scale_ub is not None:
2025-05-07T20:32:54.1387740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1388081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1388404Z             )
2025-05-07T20:32:54.1388606Z         else:
2025-05-07T20:32:54.1388824Z             scale_ub_tensor = None
2025-05-07T20:32:54.1389083Z     
2025-05-07T20:32:54.1389328Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1389652Z             op = silu_mul_quant
2025-05-07T20:32:54.1389919Z             if compiled:
2025-05-07T20:32:54.1390175Z                 op = torch.compile(op)
2025-05-07T20:32:54.1390485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1390766Z     
2025-05-07T20:32:54.1390972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1391144Z 
2025-05-07T20:32:54.1391254Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1391562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1391909Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1392200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1392990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1393712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1394277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1394991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1395678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1396295Z     kernel = self.compile(
2025-05-07T20:32:54.1396856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1397539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1397952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1398201Z 
2025-05-07T20:32:54.1398444Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871897470>
2025-05-07T20:32:54.1399637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1401074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd8718307c0>}
2025-05-07T20:32:54.1402478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1403606Z context = <triton._C.libtriton.ir.context object at 0x7fd87176bbf0>
2025-05-07T20:32:54.1403912Z 
2025-05-07T20:32:54.1404090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1404636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1405124Z                            module_map=module_map)
2025-05-07T20:32:54.1405509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1405877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1406378Z E       ^
2025-05-07T20:32:54.1406874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1407356Z 
2025-05-07T20:32:54.1407795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1408365Z 
2025-05-07T20:32:54.1408500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1408936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1409360Z     T=2048,
2025-05-07T20:32:54.1409552Z     D=7168,
2025-05-07T20:32:54.1409746Z     scale_ub=None,
2025-05-07T20:32:54.1409971Z     contiguous=False,
2025-05-07T20:32:54.1410207Z     compiled=False,
2025-05-07T20:32:54.1410414Z )
2025-05-07T20:32:54.1410752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1411280Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.1411571Z 
2025-05-07T20:32:54.1411656Z     @given(
2025-05-07T20:32:54.1411944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1412272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1412591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1412927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1413273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1413564Z     )
2025-05-07T20:32:54.1414007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1414469Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1414721Z         self,
2025-05-07T20:32:54.1414923Z         T: int,
2025-05-07T20:32:54.1415125Z         D: int,
2025-05-07T20:32:54.1415350Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1415627Z         contiguous: bool,
2025-05-07T20:32:54.1415872Z         compiled: bool,
2025-05-07T20:32:54.1416099Z     ) -> None:
2025-05-07T20:32:54.1416318Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1416566Z     
2025-05-07T20:32:54.1416915Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1419162Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1421149Z 
2025-05-07T20:32:54.1421276Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.1421496Z 
2025-05-07T20:32:54.1421604Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1422031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1422458Z     T=128,
2025-05-07T20:32:54.1422649Z     D=7168,
2025-05-07T20:32:54.1422842Z     scale_ub=1200.0,
2025-05-07T20:32:54.1423069Z     contiguous=True,
2025-05-07T20:32:54.1423296Z     compiled=True,
2025-05-07T20:32:54.1423567Z )
2025-05-07T20:32:54.1732192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1732926Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1733315Z 
2025-05-07T20:32:54.1733446Z     @given(
2025-05-07T20:32:54.1733806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1734232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1734596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1734940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1735278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1735572Z     )
2025-05-07T20:32:54.1735937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1736405Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1736649Z         self,
2025-05-07T20:32:54.1736847Z         T: int,
2025-05-07T20:32:54.1737051Z         D: int,
2025-05-07T20:32:54.1737277Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1737565Z         contiguous: bool,
2025-05-07T20:32:54.1737818Z         compiled: bool,
2025-05-07T20:32:54.1738042Z     ) -> None:
2025-05-07T20:32:54.1738279Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1738575Z     
2025-05-07T20:32:54.1738848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1739205Z     
2025-05-07T20:32:54.1739408Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1739707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1740030Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1740282Z         x0 = x[:, :D]
2025-05-07T20:32:54.1740502Z         x1 = x[:, D:]
2025-05-07T20:32:54.1740717Z     
2025-05-07T20:32:54.1740906Z         if contiguous:
2025-05-07T20:32:54.1741139Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1741407Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1741652Z     
2025-05-07T20:32:54.1741858Z         if scale_ub is not None:
2025-05-07T20:32:54.1742140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1742482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1742807Z             )
2025-05-07T20:32:54.1743112Z         else:
2025-05-07T20:32:54.1743338Z             scale_ub_tensor = None
2025-05-07T20:32:54.1743602Z     
2025-05-07T20:32:54.1743840Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1744170Z             op = silu_mul_quant
2025-05-07T20:32:54.1744430Z             if compiled:
2025-05-07T20:32:54.1744683Z                 op = torch.compile(op)
2025-05-07T20:32:54.1744989Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1745277Z     
2025-05-07T20:32:54.1745546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1745720Z 
2025-05-07T20:32:54.1745826Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1746139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1746483Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1746770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1747362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.1747950Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.1748704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1749434Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1749998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1750721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1751423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1751989Z     kernel = self.compile(
2025-05-07T20:32:54.1752621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1753300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1753716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1753961Z 
2025-05-07T20:32:54.1754180Z self = <triton.compiler.compiler.ASTSource object at 0x7fd87173bcb0>
2025-05-07T20:32:54.1755308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1756734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871831940>}
2025-05-07T20:32:54.1758140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1759258Z context = <triton._C.libtriton.ir.context object at 0x7fd8717310f0>
2025-05-07T20:32:54.1759565Z 
2025-05-07T20:32:54.1759739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1760290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1760773Z                            module_map=module_map)
2025-05-07T20:32:54.1761156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1761526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1761788Z E       ^
2025-05-07T20:32:54.1762278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1762756Z 
2025-05-07T20:32:54.1763188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1763726Z 
2025-05-07T20:32:54.1763836Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1764309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1764731Z     T=128,
2025-05-07T20:32:54.1764921Z     D=7168,
2025-05-07T20:32:54.1765116Z     scale_ub=1200.0,
2025-05-07T20:32:54.1765349Z     contiguous=True,
2025-05-07T20:32:54.1765578Z     compiled=False,
2025-05-07T20:32:54.1765784Z )
2025-05-07T20:32:54.1766120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1766633Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.1766985Z 
2025-05-07T20:32:54.1767074Z     @given(
2025-05-07T20:32:54.1767305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1767626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1767948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1768286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1768628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1768921Z     )
2025-05-07T20:32:54.1769280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1769742Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1769992Z         self,
2025-05-07T20:32:54.1770232Z         T: int,
2025-05-07T20:32:54.1770438Z         D: int,
2025-05-07T20:32:54.1770657Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1770934Z         contiguous: bool,
2025-05-07T20:32:54.1771178Z         compiled: bool,
2025-05-07T20:32:54.1771406Z     ) -> None:
2025-05-07T20:32:54.1771631Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1771952Z     
2025-05-07T20:32:54.1772229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1772580Z     
2025-05-07T20:32:54.1772769Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1773117Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1775216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1777150Z 
2025-05-07T20:32:54.1777274Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.1777496Z 
2025-05-07T20:32:54.1777611Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1778033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1778455Z     T=128,
2025-05-07T20:32:54.1778643Z     D=5120,
2025-05-07T20:32:54.1778833Z     scale_ub=1200.0,
2025-05-07T20:32:54.1779061Z     contiguous=True,
2025-05-07T20:32:54.1779291Z     compiled=True,
2025-05-07T20:32:54.1779493Z )
2025-05-07T20:32:54.1779822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1780335Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1780611Z 
2025-05-07T20:32:54.1780689Z     @given(
2025-05-07T20:32:54.1780924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1781245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1781558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1781894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1782232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1782521Z     )
2025-05-07T20:32:54.1782877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1783337Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1783588Z         self,
2025-05-07T20:32:54.1783784Z         T: int,
2025-05-07T20:32:54.1783985Z         D: int,
2025-05-07T20:32:54.1784255Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1784528Z         contiguous: bool,
2025-05-07T20:32:54.1784773Z         compiled: bool,
2025-05-07T20:32:54.1785006Z     ) -> None:
2025-05-07T20:32:54.1785223Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1785470Z     
2025-05-07T20:32:54.1785749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1786092Z     
2025-05-07T20:32:54.1786283Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1786582Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1788754Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1790680Z 
2025-05-07T20:32:54.1790847Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.1791066Z 
2025-05-07T20:32:54.1791170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1791596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1792008Z     T=128,
2025-05-07T20:32:54.1792198Z     D=7168,
2025-05-07T20:32:54.1792389Z     scale_ub=None,
2025-05-07T20:32:54.1792607Z     contiguous=True,
2025-05-07T20:32:54.1792831Z     compiled=True,
2025-05-07T20:32:54.1793033Z )
2025-05-07T20:32:54.4301269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4302100Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4302378Z 
2025-05-07T20:32:54.4302456Z     @given(
2025-05-07T20:32:54.4302694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4303012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4303325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4303654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4303989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4304274Z     )
2025-05-07T20:32:54.4304628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4305091Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4305343Z         self,
2025-05-07T20:32:54.4305538Z         T: int,
2025-05-07T20:32:54.4305740Z         D: int,
2025-05-07T20:32:54.4305969Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4306404Z         contiguous: bool,
2025-05-07T20:32:54.4306657Z         compiled: bool,
2025-05-07T20:32:54.4306891Z     ) -> None:
2025-05-07T20:32:54.4307106Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4307366Z     
2025-05-07T20:32:54.4307654Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4309794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4311731Z 
2025-05-07T20:32:54.4311855Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.4312077Z 
2025-05-07T20:32:54.4327989Z FAILED
2025-05-07T20:32:54.4328487Z 
2025-05-07T20:32:54.4328923Z =================================== FAILURES ===================================
2025-05-07T20:32:54.4329831Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:54.4330493Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:54.4331422Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:54.4332324Z   |     yield
2025-05-07T20:32:54.4332968Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:54.4333872Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:54.4334716Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:54.4335546Z   |     if method() is not None:
2025-05-07T20:32:54.4336111Z   |        ^^^^^^^^
2025-05-07T20:32:54.4337062Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:54.4338172Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4338601Z   |            ^^^^^^^
2025-05-07T20:32:54.4339573Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:54.4340524Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:54.4341149Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:54.4341773Z   +-+---------------- 1 ----------------
2025-05-07T20:32:54.4342198Z     | Traceback (most recent call last):
2025-05-07T20:32:54.4343268Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.4344559Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4358419Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4361505Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4364559Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.4365201Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4365809Z     |     T=2048,
2025-05-07T20:32:54.4366149Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.4366643Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.4367165Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.4367706Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:54.4368161Z     | )
2025-05-07T20:32:54.4368457Z     | 
2025-05-07T20:32:54.4369251Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:54.4370173Z     +---------------- 2 ----------------
2025-05-07T20:32:54.4370595Z     | Traceback (most recent call last):
2025-05-07T20:32:54.4371656Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.4372936Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4373491Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4376624Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4379678Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.4380387Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4380939Z     |     T=128,
2025-05-07T20:32:54.4381149Z     |     D=7168,
2025-05-07T20:32:54.4381359Z     |     scale_ub=None,
2025-05-07T20:32:54.4381602Z     |     contiguous=True,
2025-05-07T20:32:54.4381847Z     |     compiled=True,
2025-05-07T20:32:54.4382067Z     | )
2025-05-07T20:32:54.4382252Z     | 
2025-05-07T20:32:54.4382798Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.4383482Z     +---------------- 3 ----------------
2025-05-07T20:32:54.4383777Z     | Traceback (most recent call last):
2025-05-07T20:32:54.4384512Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.4385318Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4385704Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4387932Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4390095Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.4390551Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4390976Z     |     T=128,
2025-05-07T20:32:54.4391183Z     |     D=5120,
2025-05-07T20:32:54.4391402Z     |     scale_ub=1200.0,
2025-05-07T20:32:54.4391653Z     |     contiguous=True,
2025-05-07T20:32:54.4391894Z     |     compiled=True,
2025-05-07T20:32:54.4392125Z     | )
2025-05-07T20:32:54.4392313Z     | 
2025-05-07T20:32:54.4392848Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.4393481Z     +---------------- 4 ----------------
2025-05-07T20:32:54.4393782Z     | Traceback (most recent call last):
2025-05-07T20:32:54.4394523Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:54.4395255Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4395555Z     |                              ^^^^^^^^
2025-05-07T20:32:54.4396217Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:54.4396943Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4397282Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4398119Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:54.4399071Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4399698Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:54.4400467Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4400935Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4401601Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:54.4402447Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4402930Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4403600Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:54.4404331Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4404744Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4405666Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:54.4406746Z     |     fn()
2025-05-07T20:32:54.4407742Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:54.4408668Z     |     self.fn.run(
2025-05-07T20:32:54.4409426Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:54.4410409Z     |     kernel = self.compile(
2025-05-07T20:32:54.4410770Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:54.4411629Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:54.4412790Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4413341Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4414258Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.4415394Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4416073Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.4416576Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4417057Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4417435Z     | ^
2025-05-07T20:32:54.4418089Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4418907Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.4419470Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:54.4420215Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4420828Z     |     T=1,  # or any other generated value
2025-05-07T20:32:54.4421270Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.4421749Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.4422272Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.4422785Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:54.4423223Z     | )
2025-05-07T20:32:54.4423477Z     | 
2025-05-07T20:32:54.4424226Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.4425109Z     +------------------------------------
2025-05-07T20:32:54.4425725Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:54.4426258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4426843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4427421Z     T=1,
2025-05-07T20:32:54.4427683Z     D=5120,
2025-05-07T20:32:54.4427948Z     scale_ub=None,
2025-05-07T20:32:54.4428266Z     contiguous=True,
2025-05-07T20:32:54.4428618Z     compiled=True,
2025-05-07T20:32:54.4428913Z )
2025-05-07T20:32:54.4429500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4430183Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4430551Z 
2025-05-07T20:32:54.4430659Z     @given(
2025-05-07T20:32:54.4430982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4431424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4431841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4432308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4432787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4433194Z     )
2025-05-07T20:32:54.4433778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4434420Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4434768Z         self,
2025-05-07T20:32:54.4435034Z         T: int,
2025-05-07T20:32:54.4435322Z         D: int,
2025-05-07T20:32:54.4435631Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4436013Z         contiguous: bool,
2025-05-07T20:32:54.4436353Z         compiled: bool,
2025-05-07T20:32:54.4436679Z     ) -> None:
2025-05-07T20:32:54.4436974Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4438076Z     
2025-05-07T20:32:54.4438459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4438940Z     
2025-05-07T20:32:54.4439212Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4439634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4440068Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4440419Z         x0 = x[:, :D]
2025-05-07T20:32:54.4440718Z         x1 = x[:, D:]
2025-05-07T20:32:54.4440999Z     
2025-05-07T20:32:54.4441257Z         if contiguous:
2025-05-07T20:32:54.4441581Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4441944Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4442274Z     
2025-05-07T20:32:54.4442533Z         if scale_ub is not None:
2025-05-07T20:32:54.4442912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4443359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4443793Z             )
2025-05-07T20:32:54.4444055Z         else:
2025-05-07T20:32:54.4444339Z             scale_ub_tensor = None
2025-05-07T20:32:54.4444680Z     
2025-05-07T20:32:54.4445006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4445441Z             op = silu_mul_quant
2025-05-07T20:32:54.4445805Z             if compiled:
2025-05-07T20:32:54.4446149Z                 op = torch.compile(op)
2025-05-07T20:32:54.4446562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4446961Z     
2025-05-07T20:32:54.4447234Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4447628Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4448039Z     
2025-05-07T20:32:54.4448419Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4448897Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4449310Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4449758Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4450275Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4450719Z     
2025-05-07T20:32:54.4451003Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4451282Z 
2025-05-07T20:32:54.4451443Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4452038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4452527Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4452991Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4454105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4455144Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4455890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4456879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4457857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4458919Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4459983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4460938Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4461773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4462494Z     fn()
2025-05-07T20:32:54.4463195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4464040Z     self.fn.run(
2025-05-07T20:32:54.4464688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4465429Z     kernel = self.compile(
2025-05-07T20:32:54.4466233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4467134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4467691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4468008Z 
2025-05-07T20:32:54.4468294Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb3948d7f0>
2025-05-07T20:32:54.4469805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4471814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33e99c60>}
2025-05-07T20:32:54.4473764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4475247Z context = <triton._C.libtriton.ir.context object at 0x7fdb381d53f0>
2025-05-07T20:32:54.4475652Z 
2025-05-07T20:32:54.4475878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4476616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4477263Z                            module_map=module_map)
2025-05-07T20:32:54.4477765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4478240Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4478605Z E       ^
2025-05-07T20:32:54.4479255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4479885Z 
2025-05-07T20:32:54.4480484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4481227Z 
2025-05-07T20:32:54.4481364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4481990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4482547Z     T=2048,
2025-05-07T20:32:54.4482800Z     D=5120,
2025-05-07T20:32:54.4483066Z     scale_ub=1200.0,
2025-05-07T20:32:54.4483376Z     contiguous=True,
2025-05-07T20:32:54.4483674Z     compiled=False,
2025-05-07T20:32:54.4483959Z )
2025-05-07T20:32:54.4484400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4485072Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.4485521Z 
2025-05-07T20:32:54.4485628Z     @given(
2025-05-07T20:32:54.4485947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4486385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4486806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4487267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4487731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4488121Z     )
2025-05-07T20:32:54.4488620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4489305Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4489646Z         self,
2025-05-07T20:32:54.4489923Z         T: int,
2025-05-07T20:32:54.4490203Z         D: int,
2025-05-07T20:32:54.4490496Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4490870Z         contiguous: bool,
2025-05-07T20:32:54.4491202Z         compiled: bool,
2025-05-07T20:32:54.4491516Z     ) -> None:
2025-05-07T20:32:54.4491917Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4492267Z     
2025-05-07T20:32:54.4492651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4493176Z     
2025-05-07T20:32:54.4493442Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4493834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4494256Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4494574Z         x0 = x[:, :D]
2025-05-07T20:32:54.4494878Z         x1 = x[:, D:]
2025-05-07T20:32:54.4495156Z     
2025-05-07T20:32:54.4495414Z         if contiguous:
2025-05-07T20:32:54.4495741Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4496086Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4496413Z     
2025-05-07T20:32:54.4496676Z         if scale_ub is not None:
2025-05-07T20:32:54.4497039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4497486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4497909Z             )
2025-05-07T20:32:54.4498157Z         else:
2025-05-07T20:32:54.4498443Z             scale_ub_tensor = None
2025-05-07T20:32:54.4498792Z     
2025-05-07T20:32:54.4499096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4499538Z             op = silu_mul_quant
2025-05-07T20:32:54.4499879Z             if compiled:
2025-05-07T20:32:54.4500210Z                 op = torch.compile(op)
2025-05-07T20:32:54.4500606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4500984Z     
2025-05-07T20:32:54.4501243Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4501465Z 
2025-05-07T20:32:54.4501603Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4502013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4502470Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4502845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4503822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4504820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4505589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4506868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4507894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4508614Z     kernel = self.compile(
2025-05-07T20:32:54.4509333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4510211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4510731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4511046Z 
2025-05-07T20:32:54.4511341Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb671d3950>
2025-05-07T20:32:54.4512942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4514953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33cf0220>}
2025-05-07T20:32:54.4516970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4518423Z context = <triton._C.libtriton.ir.context object at 0x7fdb38155db0>
2025-05-07T20:32:54.4518808Z 
2025-05-07T20:32:54.4519032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4519732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4520372Z                            module_map=module_map)
2025-05-07T20:32:54.4520856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4521393Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4521735Z E       ^
2025-05-07T20:32:54.4522365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4522976Z 
2025-05-07T20:32:54.4523543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4524234Z 
2025-05-07T20:32:54.4524364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4524908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4525446Z     T=2048,
2025-05-07T20:32:54.4525687Z     D=5120,
2025-05-07T20:32:54.4525941Z     scale_ub=1200.0,
2025-05-07T20:32:54.4526228Z     contiguous=True,
2025-05-07T20:32:54.4526519Z     compiled=True,
2025-05-07T20:32:54.4526776Z )
2025-05-07T20:32:54.4527199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4527866Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.4528232Z 
2025-05-07T20:32:54.4528332Z     @given(
2025-05-07T20:32:54.4528654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4529121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4529555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4530000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4530457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4530843Z     )
2025-05-07T20:32:54.4531329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4532036Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4532373Z         self,
2025-05-07T20:32:54.4532629Z         T: int,
2025-05-07T20:32:54.4532886Z         D: int,
2025-05-07T20:32:54.4533014Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4533146Z         contiguous: bool,
2025-05-07T20:32:54.4533263Z         compiled: bool,
2025-05-07T20:32:54.4533365Z     ) -> None:
2025-05-07T20:32:54.4533497Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4533596Z     
2025-05-07T20:32:54.4533941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4534053Z     
2025-05-07T20:32:54.4534175Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4534343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4534475Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4534585Z         x0 = x[:, :D]
2025-05-07T20:32:54.4534699Z         x1 = x[:, D:]
2025-05-07T20:32:54.4534796Z     
2025-05-07T20:32:54.4534910Z         if contiguous:
2025-05-07T20:32:54.4535039Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4535225Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4535324Z     
2025-05-07T20:32:54.4535455Z         if scale_ub is not None:
2025-05-07T20:32:54.4535593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4535780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4535887Z             )
2025-05-07T20:32:54.4535986Z         else:
2025-05-07T20:32:54.4536109Z             scale_ub_tensor = None
2025-05-07T20:32:54.4536219Z     
2025-05-07T20:32:54.4536393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4536527Z             op = silu_mul_quant
2025-05-07T20:32:54.4536720Z             if compiled:
2025-05-07T20:32:54.4536857Z                 op = torch.compile(op)
2025-05-07T20:32:54.4537009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4537107Z     
2025-05-07T20:32:54.4537225Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4537393Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4537491Z     
2025-05-07T20:32:54.4537676Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4537823Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4538022Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4538189Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4538387Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4538492Z     
2025-05-07T20:32:54.4538635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4538642Z 
2025-05-07T20:32:54.4538775Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4538953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4539101Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4539283Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4540080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4540233Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4540739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4541063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4541578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4541925Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4542464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4542684Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4543169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4543282Z     fn()
2025-05-07T20:32:54.4543847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4543973Z     self.fn.run(
2025-05-07T20:32:54.4544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4544555Z     kernel = self.compile(
2025-05-07T20:32:54.4545163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4545406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4545594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4545601Z 
2025-05-07T20:32:54.4545883Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb33c244a0>
2025-05-07T20:32:54.4546988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4547775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb33cf16c0>}
2025-05-07T20:32:54.4548843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4549251Z context = <triton._C.libtriton.ir.context object at 0x7fdb328bedf0>
2025-05-07T20:32:54.4549259Z 
2025-05-07T20:32:54.4549492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4549867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4550016Z                            module_map=module_map)
2025-05-07T20:32:54.4550236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4550380Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4550486Z E       ^
2025-05-07T20:32:54.4551067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4551074Z 
2025-05-07T20:32:54.4551689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4551696Z 
2025-05-07T20:32:54.4551837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4552166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4552272Z     T=16384,
2025-05-07T20:32:54.4552375Z     D=7168,
2025-05-07T20:32:54.4552495Z     scale_ub=1200.0,
2025-05-07T20:32:54.4552608Z     contiguous=False,
2025-05-07T20:32:54.4552719Z     compiled=False,
2025-05-07T20:32:54.4552825Z )
2025-05-07T20:32:54.4553125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4553372Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.4553382Z 
2025-05-07T20:32:54.4553488Z     @given(
2025-05-07T20:32:54.4553644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4553781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4553937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4554097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4554256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4554359Z     )
2025-05-07T20:32:54.4554703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4554835Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4554928Z         self,
2025-05-07T20:32:54.4555031Z         T: int,
2025-05-07T20:32:54.4555132Z         D: int,
2025-05-07T20:32:54.4555249Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4555354Z         contiguous: bool,
2025-05-07T20:32:54.4555467Z         compiled: bool,
2025-05-07T20:32:54.4555562Z     ) -> None:
2025-05-07T20:32:54.4555690Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4555788Z     
2025-05-07T20:32:54.4556003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4556111Z     
2025-05-07T20:32:54.4570404Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4570576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4570663Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4570754Z         x0 = x[:, :D]
2025-05-07T20:32:54.4570832Z         x1 = x[:, D:]
2025-05-07T20:32:54.4570901Z     
2025-05-07T20:32:54.4570990Z         if contiguous:
2025-05-07T20:32:54.4571082Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4571165Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4571238Z     
2025-05-07T20:32:54.4571369Z         if scale_ub is not None:
2025-05-07T20:32:54.4571469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4571610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4571684Z             )
2025-05-07T20:32:54.4571763Z         else:
2025-05-07T20:32:54.4571976Z             scale_ub_tensor = None
2025-05-07T20:32:54.4572050Z     
2025-05-07T20:32:54.4572183Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4572287Z             op = silu_mul_quant
2025-05-07T20:32:54.4572371Z             if compiled:
2025-05-07T20:32:54.4572473Z                 op = torch.compile(op)
2025-05-07T20:32:54.4572623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4572694Z     
2025-05-07T20:32:54.4572789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4572795Z 
2025-05-07T20:32:54.4572892Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4573024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4573130Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4573227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4573754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4573906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4574279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4574516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4574869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4574965Z     kernel = self.compile(
2025-05-07T20:32:54.4575413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4575607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4575750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4575755Z 
2025-05-07T20:32:54.4575985Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32a48ef0>
2025-05-07T20:32:54.4576964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4577592Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32b9cea0>}
2025-05-07T20:32:54.4578516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4578735Z context = <triton._C.libtriton.ir.context object at 0x7fdb3251c430>
2025-05-07T20:32:54.4578742Z 
2025-05-07T20:32:54.4578925Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4579236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4579352Z                            module_map=module_map)
2025-05-07T20:32:54.4579527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4579677Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4579754Z E       ^
2025-05-07T20:32:54.4580183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4580188Z 
2025-05-07T20:32:54.4580689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4580694Z 
2025-05-07T20:32:54.4580806Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4581077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4581153Z     T=1,
2025-05-07T20:32:54.4581229Z     D=7168,
2025-05-07T20:32:54.4581314Z     scale_ub=None,
2025-05-07T20:32:54.4581402Z     contiguous=True,
2025-05-07T20:32:54.4581485Z     compiled=True,
2025-05-07T20:32:54.4581564Z )
2025-05-07T20:32:54.4581786Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4581953Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4581957Z 
2025-05-07T20:32:54.4582039Z     @given(
2025-05-07T20:32:54.4582202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4582311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4582428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4582547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4582668Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4582746Z     )
2025-05-07T20:32:54.4582999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4583100Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4583177Z         self,
2025-05-07T20:32:54.4583326Z         T: int,
2025-05-07T20:32:54.4583413Z         D: int,
2025-05-07T20:32:54.4583513Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4583603Z         contiguous: bool,
2025-05-07T20:32:54.4583698Z         compiled: bool,
2025-05-07T20:32:54.4583778Z     ) -> None:
2025-05-07T20:32:54.4583879Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4583952Z     
2025-05-07T20:32:54.4584127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4584209Z     
2025-05-07T20:32:54.4584303Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4584429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4584525Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4584607Z         x0 = x[:, :D]
2025-05-07T20:32:54.4584690Z         x1 = x[:, D:]
2025-05-07T20:32:54.4584770Z     
2025-05-07T20:32:54.4584854Z         if contiguous:
2025-05-07T20:32:54.4584947Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4585042Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4585117Z     
2025-05-07T20:32:54.4585206Z         if scale_ub is not None:
2025-05-07T20:32:54.4585317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4585457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4585541Z             )
2025-05-07T20:32:54.4585620Z         else:
2025-05-07T20:32:54.4585714Z             scale_ub_tensor = None
2025-05-07T20:32:54.4585798Z     
2025-05-07T20:32:54.4585929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4586020Z             op = silu_mul_quant
2025-05-07T20:32:54.4586113Z             if compiled:
2025-05-07T20:32:54.4586216Z                 op = torch.compile(op)
2025-05-07T20:32:54.4586323Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4586404Z     
2025-05-07T20:32:54.4586496Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4586623Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4586701Z     
2025-05-07T20:32:54.4586838Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4586948Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4587050Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4587221Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4587372Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4587450Z     
2025-05-07T20:32:54.4587550Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4587555Z 
2025-05-07T20:32:54.4587660Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4587794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4587909Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4588094Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4588676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4588790Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4589161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4589391Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4589818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4590083Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4590484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4590656Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4591011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4591136Z     fn()
2025-05-07T20:32:54.4591553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4591648Z     self.fn.run(
2025-05-07T20:32:54.4592002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4592099Z     kernel = self.compile(
2025-05-07T20:32:54.4592499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4592679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4592813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4592821Z 
2025-05-07T20:32:54.4593040Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32a49460>
2025-05-07T20:32:54.4593852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4594384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bc6f20>}
2025-05-07T20:32:54.4595158Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4595361Z context = <triton._C.libtriton.ir.context object at 0x7fdb323982b0>
2025-05-07T20:32:54.4595366Z 
2025-05-07T20:32:54.4595534Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4595808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4595922Z                            module_map=module_map)
2025-05-07T20:32:54.4596088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4596191Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4596278Z E       ^
2025-05-07T20:32:54.4596690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4596695Z 
2025-05-07T20:32:54.4597135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4597139Z 
2025-05-07T20:32:54.4597244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4597474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4597557Z     T=4096,
2025-05-07T20:32:54.4597675Z     D=5120,
2025-05-07T20:32:54.4597759Z     scale_ub=None,
2025-05-07T20:32:54.4597852Z     contiguous=False,
2025-05-07T20:32:54.4597937Z     compiled=False,
2025-05-07T20:32:54.4598018Z )
2025-05-07T20:32:54.4598244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4598430Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.4598435Z 
2025-05-07T20:32:54.4598518Z     @given(
2025-05-07T20:32:54.4598642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4598741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4598906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4599026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4599142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4599224Z     )
2025-05-07T20:32:54.4599479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4599585Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4599665Z         self,
2025-05-07T20:32:54.4599743Z         T: int,
2025-05-07T20:32:54.4599824Z         D: int,
2025-05-07T20:32:54.4599923Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4600056Z         contiguous: bool,
2025-05-07T20:32:54.4600148Z         compiled: bool,
2025-05-07T20:32:54.4600226Z     ) -> None:
2025-05-07T20:32:54.4600320Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4600400Z     
2025-05-07T20:32:54.4600573Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4600649Z     
2025-05-07T20:32:54.4600749Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4600879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4600974Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4601055Z         x0 = x[:, :D]
2025-05-07T20:32:54.4601137Z         x1 = x[:, D:]
2025-05-07T20:32:54.4601218Z     
2025-05-07T20:32:54.4601304Z         if contiguous:
2025-05-07T20:32:54.4601398Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4601496Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4601569Z     
2025-05-07T20:32:54.4601660Z         if scale_ub is not None:
2025-05-07T20:32:54.4601775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4601915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4601991Z             )
2025-05-07T20:32:54.4602074Z         else:
2025-05-07T20:32:54.4602173Z             scale_ub_tensor = None
2025-05-07T20:32:54.4602247Z     
2025-05-07T20:32:54.4602391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4602488Z             op = silu_mul_quant
2025-05-07T20:32:54.4602581Z             if compiled:
2025-05-07T20:32:54.4602681Z                 op = torch.compile(op)
2025-05-07T20:32:54.4602787Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4602867Z     
2025-05-07T20:32:54.4602959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4602965Z 
2025-05-07T20:32:54.4603065Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4603209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4603311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4603414Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4603939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4604083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4604461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4604692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4605047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4605150Z     kernel = self.compile(
2025-05-07T20:32:54.4605546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4605790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4605924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4605932Z 
2025-05-07T20:32:54.4606434Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32c5a480>
2025-05-07T20:32:54.4607448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4607979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bc7ec0>}
2025-05-07T20:32:54.4608762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4608988Z context = <triton._C.libtriton.ir.context object at 0x7fdb0dbe7070>
2025-05-07T20:32:54.4609069Z 
2025-05-07T20:32:54.4609257Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4609537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4609647Z                            module_map=module_map)
2025-05-07T20:32:54.4609816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4609918Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4609994Z E       ^
2025-05-07T20:32:54.4610371Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4610376Z 
2025-05-07T20:32:54.4610806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4610814Z 
2025-05-07T20:32:54.4610926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4611159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4611238Z     T=4096,
2025-05-07T20:32:54.4611319Z     D=7168,
2025-05-07T20:32:54.4611401Z     scale_ub=None,
2025-05-07T20:32:54.4611487Z     contiguous=False,
2025-05-07T20:32:54.4611578Z     compiled=False,
2025-05-07T20:32:54.4611651Z )
2025-05-07T20:32:54.4611944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4612135Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.4612140Z 
2025-05-07T20:32:54.4612218Z     @given(
2025-05-07T20:32:54.4612345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4612444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4612556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4612682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4612795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4612868Z     )
2025-05-07T20:32:54.4613128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4613220Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4613296Z         self,
2025-05-07T20:32:54.4613381Z         T: int,
2025-05-07T20:32:54.4613538Z         D: int,
2025-05-07T20:32:54.4613645Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4613733Z         contiguous: bool,
2025-05-07T20:32:54.4613820Z         compiled: bool,
2025-05-07T20:32:54.4613905Z     ) -> None:
2025-05-07T20:32:54.4614000Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4614072Z     
2025-05-07T20:32:54.4614250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4614324Z     
2025-05-07T20:32:54.4614415Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4614616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4614704Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4614785Z         x0 = x[:, :D]
2025-05-07T20:32:54.4614869Z         x1 = x[:, D:]
2025-05-07T20:32:54.4614944Z     
2025-05-07T20:32:54.4615026Z         if contiguous:
2025-05-07T20:32:54.4615121Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4615212Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4615290Z     
2025-05-07T20:32:54.4615382Z         if scale_ub is not None:
2025-05-07T20:32:54.4615487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4615695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4615773Z             )
2025-05-07T20:32:54.4615850Z         else:
2025-05-07T20:32:54.4615951Z             scale_ub_tensor = None
2025-05-07T20:32:54.4616023Z     
2025-05-07T20:32:54.4616151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4616252Z             op = silu_mul_quant
2025-05-07T20:32:54.4616335Z             if compiled:
2025-05-07T20:32:54.4616434Z                 op = torch.compile(op)
2025-05-07T20:32:54.4616549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4616663Z     
2025-05-07T20:32:54.4616763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4616767Z 
2025-05-07T20:32:54.4616865Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4616999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4617105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4617205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4617724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4617829Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4618199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4618438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4618790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4618887Z     kernel = self.compile(
2025-05-07T20:32:54.4619289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4619469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4619600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4619614Z 
2025-05-07T20:32:54.4619829Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32b5bc50>
2025-05-07T20:32:54.4620634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4621166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bad620>}
2025-05-07T20:32:54.4621940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4622187Z context = <triton._C.libtriton.ir.context object at 0x7fdb0da9c2f0>
2025-05-07T20:32:54.4622192Z 
2025-05-07T20:32:54.4622367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4622640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4622753Z                            module_map=module_map)
2025-05-07T20:32:54.4622919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4623064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4623142Z E       ^
2025-05-07T20:32:54.4623508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4623516Z 
2025-05-07T20:32:54.4623949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4623954Z 
2025-05-07T20:32:54.4624061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4624290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4624375Z     T=128,
2025-05-07T20:32:54.4624493Z     D=7168,
2025-05-07T20:32:54.4624584Z     scale_ub=None,
2025-05-07T20:32:54.4624672Z     contiguous=False,
2025-05-07T20:32:54.4624756Z     compiled=True,
2025-05-07T20:32:54.4624838Z )
2025-05-07T20:32:54.4625062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4625239Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.4625243Z 
2025-05-07T20:32:54.4625328Z     @given(
2025-05-07T20:32:54.4625448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4625606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4625729Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4625844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4625969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4626043Z     )
2025-05-07T20:32:54.4626295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4626394Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4626469Z         self,
2025-05-07T20:32:54.4626548Z         T: int,
2025-05-07T20:32:54.4626630Z         D: int,
2025-05-07T20:32:54.4626729Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4626817Z         contiguous: bool,
2025-05-07T20:32:54.4626922Z         compiled: bool,
2025-05-07T20:32:54.4627002Z     ) -> None:
2025-05-07T20:32:54.4627106Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4627180Z     
2025-05-07T20:32:54.4627352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4627437Z     
2025-05-07T20:32:54.4627529Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4627654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4627749Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4627832Z         x0 = x[:, :D]
2025-05-07T20:32:54.4627920Z         x1 = x[:, D:]
2025-05-07T20:32:54.4627992Z     
2025-05-07T20:32:54.4628076Z         if contiguous:
2025-05-07T20:32:54.4628186Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4628289Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4628372Z     
2025-05-07T20:32:54.4628484Z         if scale_ub is not None:
2025-05-07T20:32:54.4628589Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4628727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4628810Z             )
2025-05-07T20:32:54.4628886Z         else:
2025-05-07T20:32:54.4628980Z             scale_ub_tensor = None
2025-05-07T20:32:54.4629061Z     
2025-05-07T20:32:54.4629193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4629295Z             op = silu_mul_quant
2025-05-07T20:32:54.4629378Z             if compiled:
2025-05-07T20:32:54.4629476Z                 op = torch.compile(op)
2025-05-07T20:32:54.4629636Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4629710Z     
2025-05-07T20:32:54.4629800Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4629931Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4630006Z     
2025-05-07T20:32:54.4630143Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4630251Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4630351Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4630517Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4630669Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4630741Z     
2025-05-07T20:32:54.4630847Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4630855Z 
2025-05-07T20:32:54.4630952Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4631083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4631196Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4631334Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4631957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4632066Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4632436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4632671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4633055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4633358Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4633758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4633927Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4634290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4634367Z     fn()
2025-05-07T20:32:54.4634783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4634874Z     self.fn.run(
2025-05-07T20:32:54.4635223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4635319Z     kernel = self.compile(
2025-05-07T20:32:54.4635719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4635901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4636038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4636045Z 
2025-05-07T20:32:54.4636256Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb321cd640>
2025-05-07T20:32:54.4637069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4637602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32bacd60>}
2025-05-07T20:32:54.4638375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4638578Z context = <triton._C.libtriton.ir.context object at 0x7fdb322af6b0>
2025-05-07T20:32:54.4638583Z 
2025-05-07T20:32:54.4638799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4639116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4639233Z                            module_map=module_map)
2025-05-07T20:32:54.4639394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4639503Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4639580Z E       ^
2025-05-07T20:32:54.4639946Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4639991Z 
2025-05-07T20:32:54.4640425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4640433Z 
2025-05-07T20:32:54.4640539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4640775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4640855Z     T=128,
2025-05-07T20:32:54.4640934Z     D=7168,
2025-05-07T20:32:54.4641024Z     scale_ub=None,
2025-05-07T20:32:54.4641109Z     contiguous=False,
2025-05-07T20:32:54.4641231Z     compiled=False,
2025-05-07T20:32:54.4641313Z )
2025-05-07T20:32:54.4641536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4641711Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.4641716Z 
2025-05-07T20:32:54.4641800Z     @given(
2025-05-07T20:32:54.4641923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4642029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4642144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4642302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4642423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4642497Z     )
2025-05-07T20:32:54.4642754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4642854Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4642929Z         self,
2025-05-07T20:32:54.4643008Z         T: int,
2025-05-07T20:32:54.4643094Z         D: int,
2025-05-07T20:32:54.4643193Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4643286Z         contiguous: bool,
2025-05-07T20:32:54.4643379Z         compiled: bool,
2025-05-07T20:32:54.4643458Z     ) -> None:
2025-05-07T20:32:54.4643559Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4643635Z     
2025-05-07T20:32:54.4643805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4643885Z     
2025-05-07T20:32:54.4643978Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4644104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4644206Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4644285Z         x0 = x[:, :D]
2025-05-07T20:32:54.4644362Z         x1 = x[:, D:]
2025-05-07T20:32:54.4644442Z     
2025-05-07T20:32:54.4644527Z         if contiguous:
2025-05-07T20:32:54.4644619Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4644714Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4644789Z     
2025-05-07T20:32:54.4644890Z         if scale_ub is not None:
2025-05-07T20:32:54.4644995Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4645132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4645215Z             )
2025-05-07T20:32:54.4645293Z         else:
2025-05-07T20:32:54.4645390Z             scale_ub_tensor = None
2025-05-07T20:32:54.4645467Z     
2025-05-07T20:32:54.4645598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4645688Z             op = silu_mul_quant
2025-05-07T20:32:54.4645782Z             if compiled:
2025-05-07T20:32:54.4645882Z                 op = torch.compile(op)
2025-05-07T20:32:54.4645987Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4646066Z     
2025-05-07T20:32:54.4646206Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4646211Z 
2025-05-07T20:32:54.4646321Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4646456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4646558Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4646665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4647182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4647343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4647721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4647949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4648317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4648411Z     kernel = self.compile(
2025-05-07T20:32:54.4648812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4649034Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4649167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4649172Z 
2025-05-07T20:32:54.4649387Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32bd9b50>
2025-05-07T20:32:54.4650193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4650761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd92340>}
2025-05-07T20:32:54.4651544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4651741Z context = <triton._C.libtriton.ir.context object at 0x7fdb3230adf0>
2025-05-07T20:32:54.4651745Z 
2025-05-07T20:32:54.4651997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4652269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4652379Z                            module_map=module_map)
2025-05-07T20:32:54.4652555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4652655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4652735Z E       ^
2025-05-07T20:32:54.4653109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4653113Z 
2025-05-07T20:32:54.4653545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4653550Z 
2025-05-07T20:32:54.4653663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4653893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4653970Z     T=4096,
2025-05-07T20:32:54.4654053Z     D=5120,
2025-05-07T20:32:54.4654137Z     scale_ub=1200.0,
2025-05-07T20:32:54.4654229Z     contiguous=True,
2025-05-07T20:32:54.4654314Z     compiled=False,
2025-05-07T20:32:54.4654389Z )
2025-05-07T20:32:54.4654620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4654801Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.4654808Z 
2025-05-07T20:32:54.4654888Z     @given(
2025-05-07T20:32:54.4655012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4655111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4655275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4655407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4655531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4655614Z     )
2025-05-07T20:32:54.4655901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4655998Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4656082Z         self,
2025-05-07T20:32:54.4656161Z         T: int,
2025-05-07T20:32:54.4656280Z         D: int,
2025-05-07T20:32:54.4656389Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4656482Z         contiguous: bool,
2025-05-07T20:32:54.4656570Z         compiled: bool,
2025-05-07T20:32:54.4656656Z     ) -> None:
2025-05-07T20:32:54.4656759Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4656834Z     
2025-05-07T20:32:54.4657023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4657098Z     
2025-05-07T20:32:54.4657205Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4657336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4657427Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4657560Z         x0 = x[:, :D]
2025-05-07T20:32:54.4657644Z         x1 = x[:, D:]
2025-05-07T20:32:54.4657718Z     
2025-05-07T20:32:54.4657810Z         if contiguous:
2025-05-07T20:32:54.4657904Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4657998Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4658083Z     
2025-05-07T20:32:54.4658177Z         if scale_ub is not None:
2025-05-07T20:32:54.4658287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4658438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4658558Z             )
2025-05-07T20:32:54.4658636Z         else:
2025-05-07T20:32:54.4658747Z             scale_ub_tensor = None
2025-05-07T20:32:54.4658840Z     
2025-05-07T20:32:54.4659001Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4659107Z             op = silu_mul_quant
2025-05-07T20:32:54.4659195Z             if compiled:
2025-05-07T20:32:54.4659303Z                 op = torch.compile(op)
2025-05-07T20:32:54.4659415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4659490Z     
2025-05-07T20:32:54.4659590Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4659595Z 
2025-05-07T20:32:54.4659696Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4659837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4659954Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4660059Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4660675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4660779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4661214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4661479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4661891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4661989Z     kernel = self.compile(
2025-05-07T20:32:54.4662455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4662649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4662797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4662802Z 
2025-05-07T20:32:54.4663032Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32bda390>
2025-05-07T20:32:54.4664053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4664693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd91c60>}
2025-05-07T20:32:54.4665621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4665881Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d581130>
2025-05-07T20:32:54.4665886Z 
2025-05-07T20:32:54.4666070Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4666386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4666500Z                            module_map=module_map)
2025-05-07T20:32:54.4666678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4666787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4666866Z E       ^
2025-05-07T20:32:54.4667332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4667337Z 
2025-05-07T20:32:54.4667844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4667848Z 
2025-05-07T20:32:54.4667959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4668223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4668302Z     T=1,
2025-05-07T20:32:54.4668388Z     D=5120,
2025-05-07T20:32:54.4668541Z     scale_ub=None,
2025-05-07T20:32:54.4668653Z     contiguous=True,
2025-05-07T20:32:54.4668742Z     compiled=True,
2025-05-07T20:32:54.4668822Z )
2025-05-07T20:32:54.4669072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4669254Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4669265Z 
2025-05-07T20:32:54.4669345Z     @given(
2025-05-07T20:32:54.4669473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4669582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4669702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4669826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4669952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4670030Z     )
2025-05-07T20:32:54.4670316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4670418Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4670498Z         self,
2025-05-07T20:32:54.4670575Z         T: int,
2025-05-07T20:32:54.4670658Z         D: int,
2025-05-07T20:32:54.4670764Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4670861Z         contiguous: bool,
2025-05-07T20:32:54.4670954Z         compiled: bool,
2025-05-07T20:32:54.4671032Z     ) -> None:
2025-05-07T20:32:54.4671136Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4671210Z     
2025-05-07T20:32:54.4671398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4671482Z     
2025-05-07T20:32:54.4671576Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4671708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4671804Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4671890Z         x0 = x[:, :D]
2025-05-07T20:32:54.4671971Z         x1 = x[:, D:]
2025-05-07T20:32:54.4672049Z     
2025-05-07T20:32:54.4672135Z         if contiguous:
2025-05-07T20:32:54.4672234Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4672328Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4672402Z     
2025-05-07T20:32:54.4672503Z         if scale_ub is not None:
2025-05-07T20:32:54.4672611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4672807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4672895Z             )
2025-05-07T20:32:54.4672974Z         else:
2025-05-07T20:32:54.4673074Z             scale_ub_tensor = None
2025-05-07T20:32:54.4673159Z     
2025-05-07T20:32:54.4673297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4673389Z             op = silu_mul_quant
2025-05-07T20:32:54.4673483Z             if compiled:
2025-05-07T20:32:54.4674196Z                 op = torch.compile(op)
2025-05-07T20:32:54.4674357Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4674429Z     
2025-05-07T20:32:54.4674521Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4674649Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4674725Z     
2025-05-07T20:32:54.4674862Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4674971Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4675075Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4675198Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4675392Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4675466Z     
2025-05-07T20:32:54.4675574Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4675579Z 
2025-05-07T20:32:54.4675676Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4675810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4675925Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4676062Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4676645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4676796Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4677171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4677408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4677792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4678055Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4678452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4678625Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4678987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4679068Z     fn()
2025-05-07T20:32:54.4679483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4679575Z     self.fn.run(
2025-05-07T20:32:54.4679928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4680020Z     kernel = self.compile(
2025-05-07T20:32:54.4680423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4680601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4680739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4680747Z 
2025-05-07T20:32:54.4680956Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0dd2ea80>
2025-05-07T20:32:54.4681768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4682369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0dd932e0>}
2025-05-07T20:32:54.4683150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4683354Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d5df0f0>
2025-05-07T20:32:54.4683359Z 
2025-05-07T20:32:54.4683526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4683839Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4683951Z                            module_map=module_map)
2025-05-07T20:32:54.4684118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4684225Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4684302Z E       ^
2025-05-07T20:32:54.4684671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4684676Z 
2025-05-07T20:32:54.4685157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4685162Z 
2025-05-07T20:32:54.4685266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4685501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4685580Z     T=2048,
2025-05-07T20:32:54.4685656Z     D=5120,
2025-05-07T20:32:54.4685746Z     scale_ub=None,
2025-05-07T20:32:54.4685831Z     contiguous=True,
2025-05-07T20:32:54.4685915Z     compiled=True,
2025-05-07T20:32:54.4685993Z )
2025-05-07T20:32:54.4686260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4686438Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4686442Z 
2025-05-07T20:32:54.4686529Z     @given(
2025-05-07T20:32:54.4686650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4686756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4686875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4686993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4687114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4687187Z     )
2025-05-07T20:32:54.4687439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4687540Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4687617Z         self,
2025-05-07T20:32:54.4687693Z         T: int,
2025-05-07T20:32:54.4687775Z         D: int,
2025-05-07T20:32:54.4687875Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4687967Z         contiguous: bool,
2025-05-07T20:32:54.4688061Z         compiled: bool,
2025-05-07T20:32:54.4688148Z     ) -> None:
2025-05-07T20:32:54.4688265Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4688354Z     
2025-05-07T20:32:54.4688536Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4688615Z     
2025-05-07T20:32:54.4688712Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4688842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4688938Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4689019Z         x0 = x[:, :D]
2025-05-07T20:32:54.4689100Z         x1 = x[:, D:]
2025-05-07T20:32:54.4689182Z     
2025-05-07T20:32:54.4689272Z         if contiguous:
2025-05-07T20:32:54.4689363Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4689459Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4689534Z     
2025-05-07T20:32:54.4689624Z         if scale_ub is not None:
2025-05-07T20:32:54.4689739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4689877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4689959Z             )
2025-05-07T20:32:54.4690035Z         else:
2025-05-07T20:32:54.4690178Z             scale_ub_tensor = None
2025-05-07T20:32:54.4690259Z     
2025-05-07T20:32:54.4690388Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4690482Z             op = silu_mul_quant
2025-05-07T20:32:54.4690573Z             if compiled:
2025-05-07T20:32:54.4690673Z                 op = torch.compile(op)
2025-05-07T20:32:54.4690779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4690861Z     
2025-05-07T20:32:54.4690955Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4691118Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4691198Z     
2025-05-07T20:32:54.4691334Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4691447Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4691548Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4691671Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4691906Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4691980Z     
2025-05-07T20:32:54.4692081Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4692086Z 
2025-05-07T20:32:54.4692241Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4692374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4692479Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4692623Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4693209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4693317Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4693732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4693959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4694349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4707740Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4708179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4708360Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4708720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4708798Z     fn()
2025-05-07T20:32:54.4709216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4709301Z     self.fn.run(
2025-05-07T20:32:54.4709646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4709746Z     kernel = self.compile(
2025-05-07T20:32:54.4710140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4710321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4710451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4710457Z 
2025-05-07T20:32:54.4710667Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32162cf0>
2025-05-07T20:32:54.4711483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4712009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb32475ee0>}
2025-05-07T20:32:54.4712958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4713156Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d7bc9b0>
2025-05-07T20:32:54.4713161Z 
2025-05-07T20:32:54.4713326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4713595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4713792Z                            module_map=module_map)
2025-05-07T20:32:54.4713955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4714052Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4714128Z E       ^
2025-05-07T20:32:54.4714494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4714499Z 
2025-05-07T20:32:54.4714926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4714930Z 
2025-05-07T20:32:54.4715097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4715323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4715397Z     T=128,
2025-05-07T20:32:54.4715473Z     D=5120,
2025-05-07T20:32:54.4715552Z     scale_ub=None,
2025-05-07T20:32:54.4715634Z     contiguous=True,
2025-05-07T20:32:54.4715719Z     compiled=True,
2025-05-07T20:32:54.4715787Z )
2025-05-07T20:32:54.4716008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4716178Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4716249Z 
2025-05-07T20:32:54.4716323Z     @given(
2025-05-07T20:32:54.4716442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4716537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4716651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4716768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4716882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4716956Z     )
2025-05-07T20:32:54.4717215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4717309Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4717389Z         self,
2025-05-07T20:32:54.4717468Z         T: int,
2025-05-07T20:32:54.4717546Z         D: int,
2025-05-07T20:32:54.4717648Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4717739Z         contiguous: bool,
2025-05-07T20:32:54.4717827Z         compiled: bool,
2025-05-07T20:32:54.4717918Z     ) -> None:
2025-05-07T20:32:54.4718014Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4718085Z     
2025-05-07T20:32:54.4718264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4718336Z     
2025-05-07T20:32:54.4718431Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4718564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4718676Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4718762Z         x0 = x[:, :D]
2025-05-07T20:32:54.4718865Z         x1 = x[:, D:]
2025-05-07T20:32:54.4718936Z     
2025-05-07T20:32:54.4719025Z         if contiguous:
2025-05-07T20:32:54.4719115Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4719201Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4719280Z     
2025-05-07T20:32:54.4719368Z         if scale_ub is not None:
2025-05-07T20:32:54.4719474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4719616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4719695Z             )
2025-05-07T20:32:54.4719772Z         else:
2025-05-07T20:32:54.4719871Z             scale_ub_tensor = None
2025-05-07T20:32:54.4719942Z     
2025-05-07T20:32:54.4720118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4720214Z             op = silu_mul_quant
2025-05-07T20:32:54.4720297Z             if compiled:
2025-05-07T20:32:54.4720404Z                 op = torch.compile(op)
2025-05-07T20:32:54.4720510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4720581Z     
2025-05-07T20:32:54.4720679Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4720799Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4720870Z     
2025-05-07T20:32:54.4721055Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4721159Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4721258Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4721388Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4721536Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4721608Z     
2025-05-07T20:32:54.4721719Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4721726Z 
2025-05-07T20:32:54.4721824Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4721964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4722113Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4722250Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4722839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4722943Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4723326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4723557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4723978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4724253Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4724645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4724816Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4725176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4725255Z     fn()
2025-05-07T20:32:54.4725681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4725765Z     self.fn.run(
2025-05-07T20:32:54.4726114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4726217Z     kernel = self.compile(
2025-05-07T20:32:54.4726610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4726792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4726932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4726937Z 
2025-05-07T20:32:54.4727150Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0dd2e390>
2025-05-07T20:32:54.4727971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4728497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d6fdd00>}
2025-05-07T20:32:54.4729373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4729570Z context = <triton._C.libtriton.ir.context object at 0x7fdb32074e30>
2025-05-07T20:32:54.4729574Z 
2025-05-07T20:32:54.4729745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4730023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4730129Z                            module_map=module_map)
2025-05-07T20:32:54.4730299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4730441Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4730517Z E       ^
2025-05-07T20:32:54.4730890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4730898Z 
2025-05-07T20:32:54.4731328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4731333Z 
2025-05-07T20:32:54.4731437Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4731955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4732088Z     T=4096,
2025-05-07T20:32:54.4732172Z     D=5120,
2025-05-07T20:32:54.4732255Z     scale_ub=None,
2025-05-07T20:32:54.4732341Z     contiguous=True,
2025-05-07T20:32:54.4732429Z     compiled=True,
2025-05-07T20:32:54.4732501Z )
2025-05-07T20:32:54.4732727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4732909Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4732913Z 
2025-05-07T20:32:54.4732991Z     @given(
2025-05-07T20:32:54.4733111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4733257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4733374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4733743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4734086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4734372Z     )
2025-05-07T20:32:54.4734735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4735191Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4735430Z         self,
2025-05-07T20:32:54.4735626Z         T: int,
2025-05-07T20:32:54.4735826Z         D: int,
2025-05-07T20:32:54.4736038Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4736317Z         contiguous: bool,
2025-05-07T20:32:54.4736567Z         compiled: bool,
2025-05-07T20:32:54.4736787Z     ) -> None:
2025-05-07T20:32:54.4737009Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4737257Z     
2025-05-07T20:32:54.4737531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4737875Z     
2025-05-07T20:32:54.4738069Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4738371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4738727Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4738971Z         x0 = x[:, :D]
2025-05-07T20:32:54.4739192Z         x1 = x[:, D:]
2025-05-07T20:32:54.4739395Z     
2025-05-07T20:32:54.4739586Z         if contiguous:
2025-05-07T20:32:54.4739819Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4740072Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4740338Z     
2025-05-07T20:32:54.4740534Z         if scale_ub is not None:
2025-05-07T20:32:54.4740813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4741149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4741461Z             )
2025-05-07T20:32:54.4741654Z         else:
2025-05-07T20:32:54.4741861Z             scale_ub_tensor = None
2025-05-07T20:32:54.4742116Z     
2025-05-07T20:32:54.4742348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4742669Z             op = silu_mul_quant
2025-05-07T20:32:54.4742916Z             if compiled:
2025-05-07T20:32:54.4743217Z                 op = torch.compile(op)
2025-05-07T20:32:54.4743517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4743789Z     
2025-05-07T20:32:54.4743985Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4744275Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4744564Z     
2025-05-07T20:32:54.4744808Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4745148Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4745485Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4745805Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4746170Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4746488Z     
2025-05-07T20:32:54.4746685Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4746891Z 
2025-05-07T20:32:54.4746991Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4747298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4747634Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4747966Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4748885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4749664Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4750227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4750942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4751655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4752440Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4753200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4753857Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4754480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4755008Z     fn()
2025-05-07T20:32:54.4755538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4756139Z     self.fn.run(
2025-05-07T20:32:54.4756615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4757168Z     kernel = self.compile(
2025-05-07T20:32:54.4757724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4758404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4758806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4759047Z 
2025-05-07T20:32:54.4759261Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32471e50>
2025-05-07T20:32:54.4760385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4761831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d63b560>}
2025-05-07T20:32:54.4763222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4764286Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cbf3270>
2025-05-07T20:32:54.4764660Z 
2025-05-07T20:32:54.4764831Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4765372Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4765848Z                            module_map=module_map)
2025-05-07T20:32:54.4766224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4766585Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4766858Z E       ^
2025-05-07T20:32:54.4767376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4767848Z 
2025-05-07T20:32:54.4768280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4768816Z 
2025-05-07T20:32:54.4768927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4769350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4769766Z     T=16384,
2025-05-07T20:32:54.4769962Z     D=5120,
2025-05-07T20:32:54.4770158Z     scale_ub=None,
2025-05-07T20:32:54.4770413Z     contiguous=True,
2025-05-07T20:32:54.4770642Z     compiled=True,
2025-05-07T20:32:54.4770848Z )
2025-05-07T20:32:54.4771169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4771677Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4772052Z 
2025-05-07T20:32:54.4772138Z     @given(
2025-05-07T20:32:54.4772368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4772696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4773006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4773419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4773793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4774111Z     )
2025-05-07T20:32:54.4774515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4775025Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4775285Z         self,
2025-05-07T20:32:54.4775492Z         T: int,
2025-05-07T20:32:54.4775708Z         D: int,
2025-05-07T20:32:54.4775940Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4776232Z         contiguous: bool,
2025-05-07T20:32:54.4776490Z         compiled: bool,
2025-05-07T20:32:54.4776729Z     ) -> None:
2025-05-07T20:32:54.4776951Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4777217Z     
2025-05-07T20:32:54.4777513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4777893Z     
2025-05-07T20:32:54.4778100Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4778421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4778781Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4779072Z         x0 = x[:, :D]
2025-05-07T20:32:54.4779306Z         x1 = x[:, D:]
2025-05-07T20:32:54.4779526Z     
2025-05-07T20:32:54.4779724Z         if contiguous:
2025-05-07T20:32:54.4779973Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4780255Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4780512Z     
2025-05-07T20:32:54.4780713Z         if scale_ub is not None:
2025-05-07T20:32:54.4781008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4781372Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4781716Z             )
2025-05-07T20:32:54.4781920Z         else:
2025-05-07T20:32:54.4782137Z             scale_ub_tensor = None
2025-05-07T20:32:54.4782412Z     
2025-05-07T20:32:54.4782657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4783002Z             op = silu_mul_quant
2025-05-07T20:32:54.4783275Z             if compiled:
2025-05-07T20:32:54.4783541Z                 op = torch.compile(op)
2025-05-07T20:32:54.4783860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4784161Z     
2025-05-07T20:32:54.4784418Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4784726Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4785052Z     
2025-05-07T20:32:54.4785307Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4785682Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4786001Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4786350Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4786807Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4787117Z     
2025-05-07T20:32:54.4787321Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4787520Z 
2025-05-07T20:32:54.4787626Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4787926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4788269Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4788602Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4789516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4790289Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4790853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4791562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4792281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4793024Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4793824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4794489Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4795103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4795639Z     fn()
2025-05-07T20:32:54.4796174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4796773Z     self.fn.run(
2025-05-07T20:32:54.4797250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4797802Z     kernel = self.compile(
2025-05-07T20:32:54.4798357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4799077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4799489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4799732Z 
2025-05-07T20:32:54.4799947Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb32473380>
2025-05-07T20:32:54.4801078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4802509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ccb9620>}
2025-05-07T20:32:54.4803907Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4804971Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d103db0>
2025-05-07T20:32:54.4805268Z 
2025-05-07T20:32:54.4805442Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4806028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4806793Z                            module_map=module_map)
2025-05-07T20:32:54.4807177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4807540Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4807804Z E       ^
2025-05-07T20:32:54.4808285Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4808918Z 
2025-05-07T20:32:54.4809355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4809889Z 
2025-05-07T20:32:54.4810005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4810429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4810847Z     T=1,
2025-05-07T20:32:54.4811035Z     D=5120,
2025-05-07T20:32:54.4811223Z     scale_ub=1200.0,
2025-05-07T20:32:54.4811451Z     contiguous=True,
2025-05-07T20:32:54.4811675Z     compiled=True,
2025-05-07T20:32:54.4811957Z )
2025-05-07T20:32:54.4812361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4812866Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.4813135Z 
2025-05-07T20:32:54.4813216Z     @given(
2025-05-07T20:32:54.4813443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4813762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4814077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4814409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4814747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4815119Z     )
2025-05-07T20:32:54.4815472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4815924Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4816172Z         self,
2025-05-07T20:32:54.4816362Z         T: int,
2025-05-07T20:32:54.4816561Z         D: int,
2025-05-07T20:32:54.4816782Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4817055Z         contiguous: bool,
2025-05-07T20:32:54.4817301Z         compiled: bool,
2025-05-07T20:32:54.4817526Z     ) -> None:
2025-05-07T20:32:54.4817745Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4817987Z     
2025-05-07T20:32:54.4818267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4818619Z     
2025-05-07T20:32:54.4818806Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4819101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4819418Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4819661Z         x0 = x[:, :D]
2025-05-07T20:32:54.4819878Z         x1 = x[:, D:]
2025-05-07T20:32:54.4820089Z     
2025-05-07T20:32:54.4820272Z         if contiguous:
2025-05-07T20:32:54.4820508Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4820771Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4821010Z     
2025-05-07T20:32:54.4821202Z         if scale_ub is not None:
2025-05-07T20:32:54.4821481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4821815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4822127Z             )
2025-05-07T20:32:54.4822323Z         else:
2025-05-07T20:32:54.4822534Z             scale_ub_tensor = None
2025-05-07T20:32:54.4822781Z     
2025-05-07T20:32:54.4823015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4823338Z             op = silu_mul_quant
2025-05-07T20:32:54.4823585Z             if compiled:
2025-05-07T20:32:54.4823834Z                 op = torch.compile(op)
2025-05-07T20:32:54.4824139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4824410Z     
2025-05-07T20:32:54.4824604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4824771Z 
2025-05-07T20:32:54.4824876Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4825244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4825585Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4825874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4826448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.4827019Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.4827694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4828445Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4828990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4829694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4830376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4830921Z     kernel = self.compile(
2025-05-07T20:32:54.4831511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4832197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4832605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4832840Z 
2025-05-07T20:32:54.4833062Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d61fe90>
2025-05-07T20:32:54.4834177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4835683Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c844720>}
2025-05-07T20:32:54.4837077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4838143Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d2aec30>
2025-05-07T20:32:54.4838440Z 
2025-05-07T20:32:54.4838611Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4839207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4839690Z                            module_map=module_map)
2025-05-07T20:32:54.4840067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4840429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4840694Z E       ^
2025-05-07T20:32:54.4841179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4841644Z 
2025-05-07T20:32:54.4842075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4842611Z 
2025-05-07T20:32:54.4842713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4843134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4843546Z     T=1,
2025-05-07T20:32:54.4843725Z     D=5120,
2025-05-07T20:32:54.4843919Z     scale_ub=None,
2025-05-07T20:32:54.4844140Z     contiguous=False,
2025-05-07T20:32:54.4844359Z     compiled=True,
2025-05-07T20:32:54.4844559Z )
2025-05-07T20:32:54.4844887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4845385Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.4845660Z 
2025-05-07T20:32:54.4845737Z     @given(
2025-05-07T20:32:54.4845970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4846336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4846641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4846978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4847316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4847598Z     )
2025-05-07T20:32:54.4847957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4848410Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4848690Z         self,
2025-05-07T20:32:54.4848885Z         T: int,
2025-05-07T20:32:54.4849086Z         D: int,
2025-05-07T20:32:54.4849339Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4849620Z         contiguous: bool,
2025-05-07T20:32:54.4849864Z         compiled: bool,
2025-05-07T20:32:54.4850083Z     ) -> None:
2025-05-07T20:32:54.4850301Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4850547Z     
2025-05-07T20:32:54.4850818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4851169Z     
2025-05-07T20:32:54.4851363Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4858366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4858726Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4858971Z         x0 = x[:, :D]
2025-05-07T20:32:54.4859184Z         x1 = x[:, D:]
2025-05-07T20:32:54.4859389Z     
2025-05-07T20:32:54.4859577Z         if contiguous:
2025-05-07T20:32:54.4859806Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4860072Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4860309Z     
2025-05-07T20:32:54.4860495Z         if scale_ub is not None:
2025-05-07T20:32:54.4860773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4861165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4861475Z             )
2025-05-07T20:32:54.4861673Z         else:
2025-05-07T20:32:54.4861886Z             scale_ub_tensor = None
2025-05-07T20:32:54.4862138Z     
2025-05-07T20:32:54.4862380Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4862700Z             op = silu_mul_quant
2025-05-07T20:32:54.4862958Z             if compiled:
2025-05-07T20:32:54.4863205Z                 op = torch.compile(op)
2025-05-07T20:32:54.4863507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4863782Z     
2025-05-07T20:32:54.4863971Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4864262Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4864560Z     
2025-05-07T20:32:54.4864795Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4865138Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4865439Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4865756Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4866119Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4866434Z     
2025-05-07T20:32:54.4866639Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4866838Z 
2025-05-07T20:32:54.4866938Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4867245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4867592Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4867919Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4868740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4869520Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4870080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4870778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4871543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4872291Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4873039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4873696Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4874316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4874896Z     fn()
2025-05-07T20:32:54.4875412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4876011Z     self.fn.run(
2025-05-07T20:32:54.4876492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4877033Z     kernel = self.compile(
2025-05-07T20:32:54.4877588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4878262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4878709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4878946Z 
2025-05-07T20:32:54.4879158Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0cb05b80>
2025-05-07T20:32:54.4880279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4881703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c846c00>}
2025-05-07T20:32:54.4883136Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4884197Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d28f2f0>
2025-05-07T20:32:54.4884496Z 
2025-05-07T20:32:54.4884664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4885203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4885684Z                            module_map=module_map)
2025-05-07T20:32:54.4886055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4886418Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4886688Z E       ^
2025-05-07T20:32:54.4887167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4887634Z 
2025-05-07T20:32:54.4888065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4888602Z 
2025-05-07T20:32:54.4888705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4889124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4889534Z     T=1,
2025-05-07T20:32:54.4889710Z     D=5120,
2025-05-07T20:32:54.4889907Z     scale_ub=None,
2025-05-07T20:32:54.4890122Z     contiguous=True,
2025-05-07T20:32:54.4890340Z     compiled=False,
2025-05-07T20:32:54.4890545Z )
2025-05-07T20:32:54.4890881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4891372Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.4891646Z 
2025-05-07T20:32:54.4891723Z     @given(
2025-05-07T20:32:54.4892086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4892400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4892711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4893096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4893440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4893727Z     )
2025-05-07T20:32:54.4894086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4894539Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4894779Z         self,
2025-05-07T20:32:54.4894972Z         T: int,
2025-05-07T20:32:54.4895168Z         D: int,
2025-05-07T20:32:54.4895383Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4895705Z         contiguous: bool,
2025-05-07T20:32:54.4895947Z         compiled: bool,
2025-05-07T20:32:54.4896166Z     ) -> None:
2025-05-07T20:32:54.4896383Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4896636Z     
2025-05-07T20:32:54.4896905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4897254Z     
2025-05-07T20:32:54.4897446Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4897736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4898054Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4898295Z         x0 = x[:, :D]
2025-05-07T20:32:54.4898634Z         x1 = x[:, D:]
2025-05-07T20:32:54.4898845Z     
2025-05-07T20:32:54.4899031Z         if contiguous:
2025-05-07T20:32:54.4899263Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4899515Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4899766Z     
2025-05-07T20:32:54.4899958Z         if scale_ub is not None:
2025-05-07T20:32:54.4900230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4900572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4900887Z             )
2025-05-07T20:32:54.4901075Z         else:
2025-05-07T20:32:54.4901340Z             scale_ub_tensor = None
2025-05-07T20:32:54.4901595Z     
2025-05-07T20:32:54.4901823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4902146Z             op = silu_mul_quant
2025-05-07T20:32:54.4902409Z             if compiled:
2025-05-07T20:32:54.4902659Z                 op = torch.compile(op)
2025-05-07T20:32:54.4902960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4903241Z     
2025-05-07T20:32:54.4903429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4903602Z 
2025-05-07T20:32:54.4903701Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4904007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4904346Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4904628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4905341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4906053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4906942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4907654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4908339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4908935Z     kernel = self.compile(
2025-05-07T20:32:54.4909484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4910156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4910564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4910799Z 
2025-05-07T20:32:54.4911017Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb3244fe60>
2025-05-07T20:32:54.4912129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4913715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c8476a0>}
2025-05-07T20:32:54.4915112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4916171Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c53bdb0>
2025-05-07T20:32:54.4916540Z 
2025-05-07T20:32:54.4916709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4917246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4917725Z                            module_map=module_map)
2025-05-07T20:32:54.4918094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4918445Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4918707Z E       ^
2025-05-07T20:32:54.4919333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4919801Z 
2025-05-07T20:32:54.4920235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4920765Z 
2025-05-07T20:32:54.4920869Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4921293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4921708Z     T=128,
2025-05-07T20:32:54.4921892Z     D=5120,
2025-05-07T20:32:54.4922088Z     scale_ub=None,
2025-05-07T20:32:54.4922304Z     contiguous=False,
2025-05-07T20:32:54.4922599Z     compiled=True,
2025-05-07T20:32:54.4922804Z )
2025-05-07T20:32:54.4923133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4923635Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.4923918Z 
2025-05-07T20:32:54.4923994Z     @given(
2025-05-07T20:32:54.4924226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4924549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4924854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4925189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4925526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4925809Z     )
2025-05-07T20:32:54.4926174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4926625Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4926865Z         self,
2025-05-07T20:32:54.4927063Z         T: int,
2025-05-07T20:32:54.4927265Z         D: int,
2025-05-07T20:32:54.4927479Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4927754Z         contiguous: bool,
2025-05-07T20:32:54.4927998Z         compiled: bool,
2025-05-07T20:32:54.4928225Z     ) -> None:
2025-05-07T20:32:54.4928437Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4928681Z     
2025-05-07T20:32:54.4928956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4929296Z     
2025-05-07T20:32:54.4929488Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4929781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4930088Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4930327Z         x0 = x[:, :D]
2025-05-07T20:32:54.4930546Z         x1 = x[:, D:]
2025-05-07T20:32:54.4930747Z     
2025-05-07T20:32:54.4930932Z         if contiguous:
2025-05-07T20:32:54.4931163Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4931415Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4931658Z     
2025-05-07T20:32:54.4931925Z         if scale_ub is not None:
2025-05-07T20:32:54.4932209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4932545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4932911Z             )
2025-05-07T20:32:54.4933107Z         else:
2025-05-07T20:32:54.4933322Z             scale_ub_tensor = None
2025-05-07T20:32:54.4933568Z     
2025-05-07T20:32:54.4933805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4934124Z             op = silu_mul_quant
2025-05-07T20:32:54.4934370Z             if compiled:
2025-05-07T20:32:54.4934619Z                 op = torch.compile(op)
2025-05-07T20:32:54.4934915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4935233Z     
2025-05-07T20:32:54.4935428Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4935592Z 
2025-05-07T20:32:54.4935699Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4935992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4936335Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4936618Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4937192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.4937762Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.4938530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4939237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4939604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4939835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4940189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4940326Z     kernel = self.compile(
2025-05-07T20:32:54.4940722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4940903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4941033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4941038Z 
2025-05-07T20:32:54.4941253Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c8685c0>
2025-05-07T20:32:54.4942056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4942584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c8451c0>}
2025-05-07T20:32:54.4943356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4943559Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c5b96b0>
2025-05-07T20:32:54.4943564Z 
2025-05-07T20:32:54.4943730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4944002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4944116Z                            module_map=module_map)
2025-05-07T20:32:54.4944279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4944377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4944460Z E       ^
2025-05-07T20:32:54.4944824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4944832Z 
2025-05-07T20:32:54.4945263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4945268Z 
2025-05-07T20:32:54.4945370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4945640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4945726Z     T=128,
2025-05-07T20:32:54.4945802Z     D=7168,
2025-05-07T20:32:54.4945888Z     scale_ub=1200.0,
2025-05-07T20:32:54.4945981Z     contiguous=False,
2025-05-07T20:32:54.4946065Z     compiled=False,
2025-05-07T20:32:54.4946143Z )
2025-05-07T20:32:54.4946369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4946543Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.4946588Z 
2025-05-07T20:32:54.4946670Z     @given(
2025-05-07T20:32:54.4946791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4946889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4947017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4947133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4947247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4947327Z     )
2025-05-07T20:32:54.4947578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4947720Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4947797Z         self,
2025-05-07T20:32:54.4947874Z         T: int,
2025-05-07T20:32:54.4947955Z         D: int,
2025-05-07T20:32:54.4948053Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4948143Z         contiguous: bool,
2025-05-07T20:32:54.4948234Z         compiled: bool,
2025-05-07T20:32:54.4948314Z     ) -> None:
2025-05-07T20:32:54.4948407Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4948485Z     
2025-05-07T20:32:54.4948681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4948823Z     
2025-05-07T20:32:54.4948922Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4949045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4949140Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4949223Z         x0 = x[:, :D]
2025-05-07T20:32:54.4949303Z         x1 = x[:, D:]
2025-05-07T20:32:54.4949385Z     
2025-05-07T20:32:54.4949468Z         if contiguous:
2025-05-07T20:32:54.4949561Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4949657Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4949729Z     
2025-05-07T20:32:54.4949817Z         if scale_ub is not None:
2025-05-07T20:32:54.4949930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4950065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4950142Z             )
2025-05-07T20:32:54.4950226Z         else:
2025-05-07T20:32:54.4950319Z             scale_ub_tensor = None
2025-05-07T20:32:54.4950391Z     
2025-05-07T20:32:54.4950530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4950623Z             op = silu_mul_quant
2025-05-07T20:32:54.4950712Z             if compiled:
2025-05-07T20:32:54.4950811Z                 op = torch.compile(op)
2025-05-07T20:32:54.4950918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4950995Z     
2025-05-07T20:32:54.4951086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4951090Z 
2025-05-07T20:32:54.4951190Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4951327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4951427Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4951525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4952042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4952139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4952512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4952744Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4953143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4953246Z     kernel = self.compile(
2025-05-07T20:32:54.4953642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4953828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4953957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4953962Z 
2025-05-07T20:32:54.4954213Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c86aea0>
2025-05-07T20:32:54.4955021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4955546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ccb82c0>}
2025-05-07T20:32:54.4956357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4956553Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d9db6b0>
2025-05-07T20:32:54.4956558Z 
2025-05-07T20:32:54.4956728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4957006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4957111Z                            module_map=module_map)
2025-05-07T20:32:54.4957321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4957420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4957496Z E       ^
2025-05-07T20:32:54.4957868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4957873Z 
2025-05-07T20:32:54.4958300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4958305Z 
2025-05-07T20:32:54.4958413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4958641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4958718Z     T=128,
2025-05-07T20:32:54.4958800Z     D=5120,
2025-05-07T20:32:54.4958884Z     scale_ub=None,
2025-05-07T20:32:54.4958972Z     contiguous=False,
2025-05-07T20:32:54.4959060Z     compiled=False,
2025-05-07T20:32:54.4959133Z )
2025-05-07T20:32:54.4959356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4959539Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.4959543Z 
2025-05-07T20:32:54.4959620Z     @given(
2025-05-07T20:32:54.4959747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4959850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4959967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4960089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4960201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4960274Z     )
2025-05-07T20:32:54.4960531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4960625Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4960702Z         self,
2025-05-07T20:32:54.4960784Z         T: int,
2025-05-07T20:32:54.4960860Z         D: int,
2025-05-07T20:32:54.4960957Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4961058Z         contiguous: bool,
2025-05-07T20:32:54.4961143Z         compiled: bool,
2025-05-07T20:32:54.4961226Z     ) -> None:
2025-05-07T20:32:54.4961319Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4961391Z     
2025-05-07T20:32:54.4961613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4961687Z     
2025-05-07T20:32:54.4961781Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4961912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4962003Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4962083Z         x0 = x[:, :D]
2025-05-07T20:32:54.4962167Z         x1 = x[:, D:]
2025-05-07T20:32:54.4962238Z     
2025-05-07T20:32:54.4962322Z         if contiguous:
2025-05-07T20:32:54.4962481Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4962569Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4962645Z     
2025-05-07T20:32:54.4962735Z         if scale_ub is not None:
2025-05-07T20:32:54.4962838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4962983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4963057Z             )
2025-05-07T20:32:54.4963133Z         else:
2025-05-07T20:32:54.4963238Z             scale_ub_tensor = None
2025-05-07T20:32:54.4963311Z     
2025-05-07T20:32:54.4963441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4963590Z             op = silu_mul_quant
2025-05-07T20:32:54.4963675Z             if compiled:
2025-05-07T20:32:54.4963778Z                 op = torch.compile(op)
2025-05-07T20:32:54.4963891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4963964Z     
2025-05-07T20:32:54.4964062Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4964066Z 
2025-05-07T20:32:54.4964164Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4964295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4964401Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4964543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4965057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4965163Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4965532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4965767Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4966118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4966212Z     kernel = self.compile(
2025-05-07T20:32:54.4966613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4966795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4966924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4966938Z 
2025-05-07T20:32:54.4967149Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ada90>
2025-05-07T20:32:54.4967956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4968481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c840400>}
2025-05-07T20:32:54.4969302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4969508Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d98d8b0>
2025-05-07T20:32:54.4969515Z 
2025-05-07T20:32:54.4969683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4969953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4970114Z                            module_map=module_map)
2025-05-07T20:32:54.4970278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4970379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4970463Z E       ^
2025-05-07T20:32:54.4970827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4970832Z 
2025-05-07T20:32:54.4971263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4971309Z 
2025-05-07T20:32:54.4971413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4971639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4971727Z     T=128,
2025-05-07T20:32:54.4971876Z     D=5120,
2025-05-07T20:32:54.4971983Z     scale_ub=1200.0,
2025-05-07T20:32:54.4972068Z     contiguous=True,
2025-05-07T20:32:54.4972150Z     compiled=False,
2025-05-07T20:32:54.4972231Z )
2025-05-07T20:32:54.4972454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4972676Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.4972682Z 
2025-05-07T20:32:54.4972764Z     @given(
2025-05-07T20:32:54.4972887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4972985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4973104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4973223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4973341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4973416Z     )
2025-05-07T20:32:54.4973667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4973810Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4973887Z         self,
2025-05-07T20:32:54.4973963Z         T: int,
2025-05-07T20:32:54.4974048Z         D: int,
2025-05-07T20:32:54.4974144Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4974232Z         contiguous: bool,
2025-05-07T20:32:54.4974322Z         compiled: bool,
2025-05-07T20:32:54.4974406Z     ) -> None:
2025-05-07T20:32:54.4974500Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4974577Z     
2025-05-07T20:32:54.4974747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4974825Z     
2025-05-07T20:32:54.4974917Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4975043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4975136Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4975215Z         x0 = x[:, :D]
2025-05-07T20:32:54.4975295Z         x1 = x[:, D:]
2025-05-07T20:32:54.4975376Z     
2025-05-07T20:32:54.4975459Z         if contiguous:
2025-05-07T20:32:54.4975550Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4975645Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4975716Z     
2025-05-07T20:32:54.4975806Z         if scale_ub is not None:
2025-05-07T20:32:54.4975918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4976056Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4976132Z             )
2025-05-07T20:32:54.4976214Z         else:
2025-05-07T20:32:54.4976308Z             scale_ub_tensor = None
2025-05-07T20:32:54.4976387Z     
2025-05-07T20:32:54.4976515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4976605Z             op = silu_mul_quant
2025-05-07T20:32:54.4976698Z             if compiled:
2025-05-07T20:32:54.4976794Z                 op = torch.compile(op)
2025-05-07T20:32:54.4976899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4976981Z     
2025-05-07T20:32:54.4977070Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4977074Z 
2025-05-07T20:32:54.4977170Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4977359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4977458Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4977562Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4978077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4978174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4978573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4978866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4979217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4979321Z     kernel = self.compile(
2025-05-07T20:32:54.4979718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4979908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4980037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4980042Z 
2025-05-07T20:32:54.4980293Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5af560>
2025-05-07T20:32:54.4981107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4981627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c841300>}
2025-05-07T20:32:54.4982442Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4982641Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c4ddeb0>
2025-05-07T20:32:54.4982645Z 
2025-05-07T20:32:54.4982821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4983090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4983196Z                            module_map=module_map)
2025-05-07T20:32:54.4983367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4983467Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4983544Z E       ^
2025-05-07T20:32:54.4983920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4983927Z 
2025-05-07T20:32:54.4984351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4984356Z 
2025-05-07T20:32:54.4984465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4984698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4984776Z     T=1,
2025-05-07T20:32:54.4984858Z     D=7168,
2025-05-07T20:32:54.4984943Z     scale_ub=1200.0,
2025-05-07T20:32:54.4985029Z     contiguous=True,
2025-05-07T20:32:54.4985119Z     compiled=True,
2025-05-07T20:32:54.4985192Z )
2025-05-07T20:32:54.4985416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4985590Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.4985597Z 
2025-05-07T20:32:54.4985675Z     @given(
2025-05-07T20:32:54.4985801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4985899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4986016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4986142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4986303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4986379Z     )
2025-05-07T20:32:54.4986638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4986733Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4986811Z         self,
2025-05-07T20:32:54.4986893Z         T: int,
2025-05-07T20:32:54.4986969Z         D: int,
2025-05-07T20:32:54.4987071Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4987159Z         contiguous: bool,
2025-05-07T20:32:54.4987242Z         compiled: bool,
2025-05-07T20:32:54.4987368Z     ) -> None:
2025-05-07T20:32:54.4987461Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4987533Z     
2025-05-07T20:32:54.4987711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4987787Z     
2025-05-07T20:32:54.4987879Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4988008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4988096Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4988175Z         x0 = x[:, :D]
2025-05-07T20:32:54.4988260Z         x1 = x[:, D:]
2025-05-07T20:32:54.4988331Z     
2025-05-07T20:32:54.4988421Z         if contiguous:
2025-05-07T20:32:54.4988553Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4988643Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4988721Z     
2025-05-07T20:32:54.4988810Z         if scale_ub is not None:
2025-05-07T20:32:54.4988913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4989055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4989133Z             )
2025-05-07T20:32:54.4989208Z         else:
2025-05-07T20:32:54.4989308Z             scale_ub_tensor = None
2025-05-07T20:32:54.4989380Z     
2025-05-07T20:32:54.4989555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4989650Z             op = silu_mul_quant
2025-05-07T20:32:54.4989735Z             if compiled:
2025-05-07T20:32:54.4989839Z                 op = torch.compile(op)
2025-05-07T20:32:54.4989948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4990020Z     
2025-05-07T20:32:54.4990114Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4990118Z 
2025-05-07T20:32:54.4990217Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4990349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4990453Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4990552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4990928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.4991029Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.4991539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4991644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4992019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4992248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4992608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4992701Z     kernel = self.compile(
2025-05-07T20:32:54.4993098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4993283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4993414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4993418Z 
2025-05-07T20:32:54.4993634Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ae720>
2025-05-07T20:32:54.4994507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4995047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c842ac0>}
2025-05-07T20:32:54.4995820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4996055Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c4433b0>
2025-05-07T20:32:54.4996060Z 
2025-05-07T20:32:54.4996235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5002355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5002491Z                            module_map=module_map)
2025-05-07T20:32:54.5002669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5002768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5002844Z E       ^
2025-05-07T20:32:54.5003301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5003308Z 
2025-05-07T20:32:54.5003739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5003744Z 
2025-05-07T20:32:54.5003856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5004086Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5004163Z     T=1,
2025-05-07T20:32:54.5004245Z     D=7168,
2025-05-07T20:32:54.5004328Z     scale_ub=1200.0,
2025-05-07T20:32:54.5004458Z     contiguous=False,
2025-05-07T20:32:54.5004544Z     compiled=True,
2025-05-07T20:32:54.5004619Z )
2025-05-07T20:32:54.5004841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5005022Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5005027Z 
2025-05-07T20:32:54.5005104Z     @given(
2025-05-07T20:32:54.5005232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5005333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5005449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5005570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5005683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5005757Z     )
2025-05-07T20:32:54.5006016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5006109Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5006456Z         self,
2025-05-07T20:32:54.5006574Z         T: int,
2025-05-07T20:32:54.5006681Z         D: int,
2025-05-07T20:32:54.5006796Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5006891Z         contiguous: bool,
2025-05-07T20:32:54.5006980Z         compiled: bool,
2025-05-07T20:32:54.5007064Z     ) -> None:
2025-05-07T20:32:54.5007158Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5007231Z     
2025-05-07T20:32:54.5007414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5007487Z     
2025-05-07T20:32:54.5007578Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5007709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5007796Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5007878Z         x0 = x[:, :D]
2025-05-07T20:32:54.5007963Z         x1 = x[:, D:]
2025-05-07T20:32:54.5008034Z     
2025-05-07T20:32:54.5008116Z         if contiguous:
2025-05-07T20:32:54.5008213Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5008305Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5008376Z     
2025-05-07T20:32:54.5008475Z         if scale_ub is not None:
2025-05-07T20:32:54.5008580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5008868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5008947Z             )
2025-05-07T20:32:54.5009021Z         else:
2025-05-07T20:32:54.5009124Z             scale_ub_tensor = None
2025-05-07T20:32:54.5009195Z     
2025-05-07T20:32:54.5009324Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5009421Z             op = silu_mul_quant
2025-05-07T20:32:54.5009505Z             if compiled:
2025-05-07T20:32:54.5009604Z                 op = torch.compile(op)
2025-05-07T20:32:54.5009782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5009853Z     
2025-05-07T20:32:54.5009943Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5009955Z 
2025-05-07T20:32:54.5010053Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5010189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5010295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5010394Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5010779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5010955Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5011464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5011570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5012019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5012251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5012606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5012774Z     kernel = self.compile(
2025-05-07T20:32:54.5013166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5013352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5013484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5013488Z 
2025-05-07T20:32:54.5013702Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c5ad460>
2025-05-07T20:32:54.5014505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5015027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c86f9c0>}
2025-05-07T20:32:54.5015805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5015998Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c7d95f0>
2025-05-07T20:32:54.5016003Z 
2025-05-07T20:32:54.5016179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5016449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5016554Z                            module_map=module_map)
2025-05-07T20:32:54.5016726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5016826Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5016906Z E       ^
2025-05-07T20:32:54.5017270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5017278Z 
2025-05-07T20:32:54.5017703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5017708Z 
2025-05-07T20:32:54.5017865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5018095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5018180Z     T=1,
2025-05-07T20:32:54.5018254Z     D=7168,
2025-05-07T20:32:54.5018336Z     scale_ub=None,
2025-05-07T20:32:54.5018428Z     contiguous=False,
2025-05-07T20:32:54.5018510Z     compiled=True,
2025-05-07T20:32:54.5018580Z )
2025-05-07T20:32:54.5018809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5019040Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5019045Z 
2025-05-07T20:32:54.5019122Z     @given(
2025-05-07T20:32:54.5019249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5019351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5019470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5019588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5019704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5019782Z     )
2025-05-07T20:32:54.5020074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5020167Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5020250Z         self,
2025-05-07T20:32:54.5020325Z         T: int,
2025-05-07T20:32:54.5020400Z         D: int,
2025-05-07T20:32:54.5020502Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5020590Z         contiguous: bool,
2025-05-07T20:32:54.5020679Z         compiled: bool,
2025-05-07T20:32:54.5020761Z     ) -> None:
2025-05-07T20:32:54.5020854Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5020929Z     
2025-05-07T20:32:54.5021098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5021215Z     
2025-05-07T20:32:54.5021313Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5021437Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5021528Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5021616Z         x0 = x[:, :D]
2025-05-07T20:32:54.5021698Z         x1 = x[:, D:]
2025-05-07T20:32:54.5021769Z     
2025-05-07T20:32:54.5021860Z         if contiguous:
2025-05-07T20:32:54.5021952Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5022039Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5022118Z     
2025-05-07T20:32:54.5022211Z         if scale_ub is not None:
2025-05-07T20:32:54.5022318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5022463Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5022537Z             )
2025-05-07T20:32:54.5022620Z         else:
2025-05-07T20:32:54.5022715Z             scale_ub_tensor = None
2025-05-07T20:32:54.5022789Z     
2025-05-07T20:32:54.5022925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5023016Z             op = silu_mul_quant
2025-05-07T20:32:54.5023100Z             if compiled:
2025-05-07T20:32:54.5023208Z                 op = torch.compile(op)
2025-05-07T20:32:54.5023314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5023388Z     
2025-05-07T20:32:54.5023488Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.5023609Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.5023681Z     
2025-05-07T20:32:54.5023823Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5023926Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.5024034Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.5024159Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.5024299Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.5024381Z     
2025-05-07T20:32:54.5024481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.5024486Z 
2025-05-07T20:32:54.5024585Z moe/activation_test.py:126: 
2025-05-07T20:32:54.5024778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5024885Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.5025026Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.5025604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.5025706Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.5026080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5026349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5026725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.5026993Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.5027384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.5027560Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.5027949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.5028027Z     fn()
2025-05-07T20:32:54.5028447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.5028530Z     self.fn.run(
2025-05-07T20:32:54.5028922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5029026Z     kernel = self.compile(
2025-05-07T20:32:54.5029415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5029666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5029797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5029802Z 
2025-05-07T20:32:54.5030010Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906f30>
2025-05-07T20:32:54.5030822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5031340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92cb80>}
2025-05-07T20:32:54.5032118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5032317Z context = <triton._C.libtriton.ir.context object at 0x7fdb0d965430>
2025-05-07T20:32:54.5032321Z 
2025-05-07T20:32:54.5032495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5032766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5032871Z                            module_map=module_map)
2025-05-07T20:32:54.5033039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5033139Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.5033214Z E       ^
2025-05-07T20:32:54.5033586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5033590Z 
2025-05-07T20:32:54.5034015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5034022Z 
2025-05-07T20:32:54.5034131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5034402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5034480Z     T=1,
2025-05-07T20:32:54.5034561Z     D=5120,
2025-05-07T20:32:54.5034645Z     scale_ub=1200.0,
2025-05-07T20:32:54.5034733Z     contiguous=False,
2025-05-07T20:32:54.5034825Z     compiled=True,
2025-05-07T20:32:54.5034897Z )
2025-05-07T20:32:54.5035128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5035297Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5035302Z 
2025-05-07T20:32:54.5035420Z     @given(
2025-05-07T20:32:54.5035547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5035645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5035759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5035885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5035998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5036071Z     )
2025-05-07T20:32:54.5036330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5036423Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5036505Z         self,
2025-05-07T20:32:54.5036622Z         T: int,
2025-05-07T20:32:54.5036700Z         D: int,
2025-05-07T20:32:54.5036802Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5036892Z         contiguous: bool,
2025-05-07T20:32:54.5036977Z         compiled: bool,
2025-05-07T20:32:54.5037058Z     ) -> None:
2025-05-07T20:32:54.5037152Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5037225Z     
2025-05-07T20:32:54.5037402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5037475Z     
2025-05-07T20:32:54.5037566Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5037740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5037828Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5037915Z         x0 = x[:, :D]
2025-05-07T20:32:54.5037994Z         x1 = x[:, D:]
2025-05-07T20:32:54.5038068Z     
2025-05-07T20:32:54.5038158Z         if contiguous:
2025-05-07T20:32:54.5038249Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5038340Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5038418Z     
2025-05-07T20:32:54.5038509Z         if scale_ub is not None:
2025-05-07T20:32:54.5038614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5038756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5038831Z             )
2025-05-07T20:32:54.5038917Z         else:
2025-05-07T20:32:54.5039034Z             scale_ub_tensor = None
2025-05-07T20:32:54.5039115Z     
2025-05-07T20:32:54.5039258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5039369Z             op = silu_mul_quant
2025-05-07T20:32:54.5039459Z             if compiled:
2025-05-07T20:32:54.5039566Z                 op = torch.compile(op)
2025-05-07T20:32:54.5039674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5039745Z     
2025-05-07T20:32:54.5039845Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5039849Z 
2025-05-07T20:32:54.5039949Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5040085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5040194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5040295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5040670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5040774Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5041280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5041390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5041756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5042034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5042393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5042497Z     kernel = self.compile(
2025-05-07T20:32:54.5042900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5043079Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5043210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5043256Z 
2025-05-07T20:32:54.5043473Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906900>
2025-05-07T20:32:54.5044277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5044807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92de40>}
2025-05-07T20:32:54.5045619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5045815Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1dd6f0>
2025-05-07T20:32:54.5045822Z 
2025-05-07T20:32:54.5045996Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5046268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5046424Z                            module_map=module_map)
2025-05-07T20:32:54.5046587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5046686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5046774Z E       ^
2025-05-07T20:32:54.5047139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5047146Z 
2025-05-07T20:32:54.5047576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5047580Z 
2025-05-07T20:32:54.5047684Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5047914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5047999Z     T=1,
2025-05-07T20:32:54.5048076Z     D=5120,
2025-05-07T20:32:54.5048160Z     scale_ub=1200.0,
2025-05-07T20:32:54.5048252Z     contiguous=False,
2025-05-07T20:32:54.5048336Z     compiled=False,
2025-05-07T20:32:54.5048410Z )
2025-05-07T20:32:54.5048639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5048810Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5048816Z 
2025-05-07T20:32:54.5048899Z     @given(
2025-05-07T20:32:54.5049020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5049124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5049244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5049362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5049480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5049559Z     )
2025-05-07T20:32:54.5049813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5049905Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5049989Z         self,
2025-05-07T20:32:54.5050066Z         T: int,
2025-05-07T20:32:54.5050149Z         D: int,
2025-05-07T20:32:54.5050247Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5050337Z         contiguous: bool,
2025-05-07T20:32:54.5050430Z         compiled: bool,
2025-05-07T20:32:54.5050507Z     ) -> None:
2025-05-07T20:32:54.5050652Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5050735Z     
2025-05-07T20:32:54.5050908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5050986Z     
2025-05-07T20:32:54.5051084Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5051209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5051297Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5051384Z         x0 = x[:, :D]
2025-05-07T20:32:54.5051463Z         x1 = x[:, D:]
2025-05-07T20:32:54.5051578Z     
2025-05-07T20:32:54.5051670Z         if contiguous:
2025-05-07T20:32:54.5051761Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5051957Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5052033Z     
2025-05-07T20:32:54.5052126Z         if scale_ub is not None:
2025-05-07T20:32:54.5052238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5052373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5052452Z             )
2025-05-07T20:32:54.5052536Z         else:
2025-05-07T20:32:54.5052630Z             scale_ub_tensor = None
2025-05-07T20:32:54.5052702Z     
2025-05-07T20:32:54.5052887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5052979Z             op = silu_mul_quant
2025-05-07T20:32:54.5053066Z             if compiled:
2025-05-07T20:32:54.5053172Z                 op = torch.compile(op)
2025-05-07T20:32:54.5053281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5053361Z     
2025-05-07T20:32:54.5053452Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5053457Z 
2025-05-07T20:32:54.5053555Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5053693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5053842Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5053942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5054470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5054568Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5054947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5055174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5055526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5055630Z     kernel = self.compile(
2025-05-07T20:32:54.5056024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5056203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5056342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5056346Z 
2025-05-07T20:32:54.5056559Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0d906060>
2025-05-07T20:32:54.5057376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5057897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0d92eac0>}
2025-05-07T20:32:54.5058725Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5058924Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c0b6c30>
2025-05-07T20:32:54.5058928Z 
2025-05-07T20:32:54.5059098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5059421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5059531Z                            module_map=module_map)
2025-05-07T20:32:54.5059703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5059802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5059879Z E       ^
2025-05-07T20:32:54.5060250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5060399Z 
2025-05-07T20:32:54.5060826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5060831Z 
2025-05-07T20:32:54.5060934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5061175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5061253Z     T=16384,
2025-05-07T20:32:54.5061335Z     D=5120,
2025-05-07T20:32:54.5061421Z     scale_ub=1200.0,
2025-05-07T20:32:54.5061505Z     contiguous=False,
2025-05-07T20:32:54.5061594Z     compiled=True,
2025-05-07T20:32:54.5061666Z )
2025-05-07T20:32:54.5061967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5062160Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5062164Z 
2025-05-07T20:32:54.5062242Z     @given(
2025-05-07T20:32:54.5062362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5062471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5062585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5062710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5062821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5062936Z     )
2025-05-07T20:32:54.5063194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5063286Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5063366Z         self,
2025-05-07T20:32:54.5063465Z         T: int,
2025-05-07T20:32:54.5063543Z         D: int,
2025-05-07T20:32:54.5063643Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5063739Z         contiguous: bool,
2025-05-07T20:32:54.5063826Z         compiled: bool,
2025-05-07T20:32:54.5063905Z     ) -> None:
2025-05-07T20:32:54.5064007Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5064080Z     
2025-05-07T20:32:54.5064253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5064336Z     
2025-05-07T20:32:54.5064428Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5064553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5064648Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5064732Z         x0 = x[:, :D]
2025-05-07T20:32:54.5064813Z         x1 = x[:, D:]
2025-05-07T20:32:54.5064891Z     
2025-05-07T20:32:54.5064973Z         if contiguous:
2025-05-07T20:32:54.5065074Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5065165Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5065238Z     
2025-05-07T20:32:54.5065337Z         if scale_ub is not None:
2025-05-07T20:32:54.5065450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5065588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5065672Z             )
2025-05-07T20:32:54.5065748Z         else:
2025-05-07T20:32:54.5065842Z             scale_ub_tensor = None
2025-05-07T20:32:54.5065923Z     
2025-05-07T20:32:54.5066060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5066150Z             op = silu_mul_quant
2025-05-07T20:32:54.5066245Z             if compiled:
2025-05-07T20:32:54.5066344Z                 op = torch.compile(op)
2025-05-07T20:32:54.5066462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5066534Z     
2025-05-07T20:32:54.5066626Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5066631Z 
2025-05-07T20:32:54.5066786Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5066919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5067020Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5067132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5067509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5067601Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5068118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5068261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5068634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5068890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5069267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5069368Z     kernel = self.compile(
2025-05-07T20:32:54.5069806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5069992Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5070122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5070127Z 
2025-05-07T20:32:54.5070338Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6b8410>
2025-05-07T20:32:54.5071146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5071709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c668180>}
2025-05-07T20:32:54.5072488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5072683Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c6d8af0>
2025-05-07T20:32:54.5072687Z 
2025-05-07T20:32:54.5072855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5073135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5073241Z                            module_map=module_map)
2025-05-07T20:32:54.5073410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5073516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5073592Z E       ^
2025-05-07T20:32:54.5073969Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5073973Z 
2025-05-07T20:32:54.5074405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5074409Z 
2025-05-07T20:32:54.5074520Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5074751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5074827Z     T=2048,
2025-05-07T20:32:54.5074910Z     D=7168,
2025-05-07T20:32:54.5074997Z     scale_ub=1200.0,
2025-05-07T20:32:54.5075084Z     contiguous=False,
2025-05-07T20:32:54.5075172Z     compiled=True,
2025-05-07T20:32:54.5075247Z )
2025-05-07T20:32:54.5075470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5075658Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5075662Z 
2025-05-07T20:32:54.5075741Z     @given(
2025-05-07T20:32:54.5075915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5076017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5076134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5076259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5076375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5076449Z     )
2025-05-07T20:32:54.5076708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5076801Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5076921Z         self,
2025-05-07T20:32:54.5077005Z         T: int,
2025-05-07T20:32:54.5077082Z         D: int,
2025-05-07T20:32:54.5077188Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5077279Z         contiguous: bool,
2025-05-07T20:32:54.5077365Z         compiled: bool,
2025-05-07T20:32:54.5077451Z     ) -> None:
2025-05-07T20:32:54.5077545Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5077618Z     
2025-05-07T20:32:54.5077797Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5077872Z     
2025-05-07T20:32:54.5077964Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5078140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5078233Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5078312Z         x0 = x[:, :D]
2025-05-07T20:32:54.5078403Z         x1 = x[:, D:]
2025-05-07T20:32:54.5078477Z     
2025-05-07T20:32:54.5078559Z         if contiguous:
2025-05-07T20:32:54.5078663Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5078757Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5078840Z     
2025-05-07T20:32:54.5078945Z         if scale_ub is not None:
2025-05-07T20:32:54.5079062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5079274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5079350Z             )
2025-05-07T20:32:54.5079426Z         else:
2025-05-07T20:32:54.5079526Z             scale_ub_tensor = None
2025-05-07T20:32:54.5079602Z     
2025-05-07T20:32:54.5079733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5079831Z             op = silu_mul_quant
2025-05-07T20:32:54.5079918Z             if compiled:
2025-05-07T20:32:54.5080018Z                 op = torch.compile(op)
2025-05-07T20:32:54.5080130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5080205Z     
2025-05-07T20:32:54.5080303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5080308Z 
2025-05-07T20:32:54.5080410Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5080543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5080650Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5080749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5081129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5081229Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5081742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5081850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5082219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5082446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5082800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5082896Z     kernel = self.compile(
2025-05-07T20:32:54.5083290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5083478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5083610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5083664Z 
2025-05-07T20:32:54.5083881Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6b90d0>
2025-05-07T20:32:54.5084689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5085210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c668ea0>}
2025-05-07T20:32:54.5086026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5086223Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c362030>
2025-05-07T20:32:54.5086228Z 
2025-05-07T20:32:54.5086404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5086678Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5086828Z                            module_map=module_map)
2025-05-07T20:32:54.5086994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5087094Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5087176Z E       ^
2025-05-07T20:32:54.5087541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5087548Z 
2025-05-07T20:32:54.5087974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5088020Z 
2025-05-07T20:32:54.5088130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5088359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5088441Z     T=1,
2025-05-07T20:32:54.5088519Z     D=5120,
2025-05-07T20:32:54.5088601Z     scale_ub=None,
2025-05-07T20:32:54.5088694Z     contiguous=False,
2025-05-07T20:32:54.5088777Z     compiled=False,
2025-05-07T20:32:54.5088855Z )
2025-05-07T20:32:54.5089085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5089255Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5089259Z 
2025-05-07T20:32:54.5089338Z     @given(
2025-05-07T20:32:54.5089464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5089564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5089685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5089807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5089924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5090004Z     )
2025-05-07T20:32:54.5090256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5090351Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5090432Z         self,
2025-05-07T20:32:54.5090507Z         T: int,
2025-05-07T20:32:54.5090586Z         D: int,
2025-05-07T20:32:54.5090691Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5090780Z         contiguous: bool,
2025-05-07T20:32:54.5090865Z         compiled: bool,
2025-05-07T20:32:54.5090949Z     ) -> None:
2025-05-07T20:32:54.5091042Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5091119Z     
2025-05-07T20:32:54.5091293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5091367Z     
2025-05-07T20:32:54.5091466Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5091591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5091683Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5091769Z         x0 = x[:, :D]
2025-05-07T20:32:54.5091958Z         x1 = x[:, D:]
2025-05-07T20:32:54.5092031Z     
2025-05-07T20:32:54.5092173Z         if contiguous:
2025-05-07T20:32:54.5092267Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5092357Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5092436Z     
2025-05-07T20:32:54.5092528Z         if scale_ub is not None:
2025-05-07T20:32:54.5092640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5092782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5092858Z             )
2025-05-07T20:32:54.5092940Z         else:
2025-05-07T20:32:54.5093034Z             scale_ub_tensor = None
2025-05-07T20:32:54.5093186Z     
2025-05-07T20:32:54.5093325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5093416Z             op = silu_mul_quant
2025-05-07T20:32:54.5093502Z             if compiled:
2025-05-07T20:32:54.5093614Z                 op = torch.compile(op)
2025-05-07T20:32:54.5093719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5093790Z     
2025-05-07T20:32:54.5093887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5093895Z 
2025-05-07T20:32:54.5093992Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5094130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5094272Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5094373Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5094898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5094995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5095368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5095601Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5095991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5096093Z     kernel = self.compile(
2025-05-07T20:32:54.5096491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5096671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5096808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5096812Z 
2025-05-07T20:32:54.5097021Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6bb5f0>
2025-05-07T20:32:54.5097835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5098356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c669e40>}
2025-05-07T20:32:54.5099132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5099339Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c73abf0>
2025-05-07T20:32:54.5099343Z 
2025-05-07T20:32:54.5099513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5099792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5099902Z                            module_map=module_map)
2025-05-07T20:32:54.5100067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5100172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5100254Z E       ^
2025-05-07T20:32:54.5100624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5100629Z 
2025-05-07T20:32:54.5101103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5101108Z 
2025-05-07T20:32:54.5101212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5101449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5101526Z     T=4096,
2025-05-07T20:32:54.5101600Z     D=7168,
2025-05-07T20:32:54.5101689Z     scale_ub=1200.0,
2025-05-07T20:32:54.5101776Z     contiguous=False,
2025-05-07T20:32:54.5101865Z     compiled=False,
2025-05-07T20:32:54.5101981Z )
2025-05-07T20:32:54.5102205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5102392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5102399Z 
2025-05-07T20:32:54.5102477Z     @given(
2025-05-07T20:32:54.5102598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5102702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5102819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5102937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5103109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5103184Z     )
2025-05-07T20:32:54.5103443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5103536Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5103612Z         self,
2025-05-07T20:32:54.5103694Z         T: int,
2025-05-07T20:32:54.5103776Z         D: int,
2025-05-07T20:32:54.5103873Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5103972Z         contiguous: bool,
2025-05-07T20:32:54.5104057Z         compiled: bool,
2025-05-07T20:32:54.5104134Z     ) -> None:
2025-05-07T20:32:54.5104279Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5104352Z     
2025-05-07T20:32:54.5104523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5104604Z     
2025-05-07T20:32:54.5104699Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5104831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5104921Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5105003Z         x0 = x[:, :D]
2025-05-07T20:32:54.5105090Z         x1 = x[:, D:]
2025-05-07T20:32:54.5105162Z     
2025-05-07T20:32:54.5105245Z         if contiguous:
2025-05-07T20:32:54.5105344Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5105432Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5105504Z     
2025-05-07T20:32:54.5105604Z         if scale_ub is not None:
2025-05-07T20:32:54.5105709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5105846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5105931Z             )
2025-05-07T20:32:54.5106008Z         else:
2025-05-07T20:32:54.5106101Z             scale_ub_tensor = None
2025-05-07T20:32:54.5106455Z     
2025-05-07T20:32:54.5106641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5106750Z             op = silu_mul_quant
2025-05-07T20:32:54.5106838Z             if compiled:
2025-05-07T20:32:54.5106938Z                 op = torch.compile(op)
2025-05-07T20:32:54.5107050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5107124Z     
2025-05-07T20:32:54.5107214Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5107219Z 
2025-05-07T20:32:54.5107323Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5107458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5107561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5107669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5108183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5108290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5108829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5109101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5109477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5109572Z     kernel = self.compile(
2025-05-07T20:32:54.5109978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5110157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5110362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5110367Z 
2025-05-07T20:32:54.5110585Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c6bad20>
2025-05-07T20:32:54.5111400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5111994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c66b380>}
2025-05-07T20:32:54.5112770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5112971Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c394970>
2025-05-07T20:32:54.5112975Z 
2025-05-07T20:32:54.5113151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5113423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5113600Z                            module_map=module_map)
2025-05-07T20:32:54.5113767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5113869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5113954Z E       ^
2025-05-07T20:32:54.5114327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5114332Z 
2025-05-07T20:32:54.5114761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5114772Z 
2025-05-07T20:32:54.5114876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5115111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5115199Z     T=16384,
2025-05-07T20:32:54.5115278Z     D=7168,
2025-05-07T20:32:54.5115362Z     scale_ub=None,
2025-05-07T20:32:54.5115456Z     contiguous=True,
2025-05-07T20:32:54.5115540Z     compiled=True,
2025-05-07T20:32:54.5115614Z )
2025-05-07T20:32:54.5115846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5116027Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.5116031Z 
2025-05-07T20:32:54.5116119Z     @given(
2025-05-07T20:32:54.5116241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5116339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5116461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5116578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5116691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5116778Z     )
2025-05-07T20:32:54.5117029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5117121Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5117205Z         self,
2025-05-07T20:32:54.5117281Z         T: int,
2025-05-07T20:32:54.5117356Z         D: int,
2025-05-07T20:32:54.5117462Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5117550Z         contiguous: bool,
2025-05-07T20:32:54.5117694Z         compiled: bool,
2025-05-07T20:32:54.5117774Z     ) -> None:
2025-05-07T20:32:54.5117868Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5117947Z     
2025-05-07T20:32:54.5118119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5118191Z     
2025-05-07T20:32:54.5118288Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5118412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5118500Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5118631Z         x0 = x[:, :D]
2025-05-07T20:32:54.5118709Z         x1 = x[:, D:]
2025-05-07T20:32:54.5118780Z     
2025-05-07T20:32:54.5118871Z         if contiguous:
2025-05-07T20:32:54.5118963Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5119060Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5119135Z     
2025-05-07T20:32:54.5119226Z         if scale_ub is not None:
2025-05-07T20:32:54.5119337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5119477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5119551Z             )
2025-05-07T20:32:54.5119632Z         else:
2025-05-07T20:32:54.5119770Z             scale_ub_tensor = None
2025-05-07T20:32:54.5119844Z     
2025-05-07T20:32:54.5119980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5120069Z             op = silu_mul_quant
2025-05-07T20:32:54.5120154Z             if compiled:
2025-05-07T20:32:54.5120260Z                 op = torch.compile(op)
2025-05-07T20:32:54.5120367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5120449Z     
2025-05-07T20:32:54.5120544Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5120548Z 
2025-05-07T20:32:54.5120646Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5120832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5120931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5121032Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5121420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5121513Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5122026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5122129Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5122498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5122733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5123085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5123181Z     kernel = self.compile(
2025-05-07T20:32:54.5123583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5123764Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5123924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5123929Z 
2025-05-07T20:32:54.5124140Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c8800>
2025-05-07T20:32:54.5124943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5125477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e04a0>}
2025-05-07T20:32:54.5126332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5132191Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c033470>
2025-05-07T20:32:54.5132200Z 
2025-05-07T20:32:54.5132400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5132679Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5132787Z                            module_map=module_map)
2025-05-07T20:32:54.5132949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5133137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5133213Z E       ^
2025-05-07T20:32:54.5133586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5133594Z 
2025-05-07T20:32:54.5134022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5134027Z 
2025-05-07T20:32:54.5134135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5134368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5134495Z     T=4096,
2025-05-07T20:32:54.5134572Z     D=5120,
2025-05-07T20:32:54.5134660Z     scale_ub=None,
2025-05-07T20:32:54.5134748Z     contiguous=False,
2025-05-07T20:32:54.5134836Z     compiled=True,
2025-05-07T20:32:54.5134911Z )
2025-05-07T20:32:54.5135134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5135319Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5135323Z 
2025-05-07T20:32:54.5135401Z     @given(
2025-05-07T20:32:54.5135521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5135670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5135784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5135900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5136025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5136099Z     )
2025-05-07T20:32:54.5136362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5136456Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5136531Z         self,
2025-05-07T20:32:54.5136612Z         T: int,
2025-05-07T20:32:54.5136687Z         D: int,
2025-05-07T20:32:54.5136785Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5136882Z         contiguous: bool,
2025-05-07T20:32:54.5136972Z         compiled: bool,
2025-05-07T20:32:54.5137051Z     ) -> None:
2025-05-07T20:32:54.5137152Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5137225Z     
2025-05-07T20:32:54.5137400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5137484Z     
2025-05-07T20:32:54.5137577Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5137708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5137801Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5137881Z         x0 = x[:, :D]
2025-05-07T20:32:54.5137969Z         x1 = x[:, D:]
2025-05-07T20:32:54.5138041Z     
2025-05-07T20:32:54.5138128Z         if contiguous:
2025-05-07T20:32:54.5138224Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5138315Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5138389Z     
2025-05-07T20:32:54.5138488Z         if scale_ub is not None:
2025-05-07T20:32:54.5138595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5138735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5138819Z             )
2025-05-07T20:32:54.5138895Z         else:
2025-05-07T20:32:54.5138989Z             scale_ub_tensor = None
2025-05-07T20:32:54.5139072Z     
2025-05-07T20:32:54.5139201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5139298Z             op = silu_mul_quant
2025-05-07T20:32:54.5139383Z             if compiled:
2025-05-07T20:32:54.5139535Z                 op = torch.compile(op)
2025-05-07T20:32:54.5139653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5139728Z     
2025-05-07T20:32:54.5139823Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5139828Z 
2025-05-07T20:32:54.5139932Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5140066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5140167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5140275Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5140700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5140803Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5141311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5141411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5141786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5142058Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5142417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5142512Z     kernel = self.compile(
2025-05-07T20:32:54.5142905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5143092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5143223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5143273Z 
2025-05-07T20:32:54.5143483Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c8350>
2025-05-07T20:32:54.5144300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5144822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e11c0>}
2025-05-07T20:32:54.5145601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5145800Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ce3cfb0>
2025-05-07T20:32:54.5145804Z 
2025-05-07T20:32:54.5145978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5146251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5146359Z                            module_map=module_map)
2025-05-07T20:32:54.5146531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5146631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5146709Z E       ^
2025-05-07T20:32:54.5147083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5147088Z 
2025-05-07T20:32:54.5147514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5147522Z 
2025-05-07T20:32:54.5147631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5147860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5147937Z     T=4096,
2025-05-07T20:32:54.5148025Z     D=5120,
2025-05-07T20:32:54.5148109Z     scale_ub=1200.0,
2025-05-07T20:32:54.5148195Z     contiguous=False,
2025-05-07T20:32:54.5148305Z     compiled=False,
2025-05-07T20:32:54.5148388Z )
2025-05-07T20:32:54.5148690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5148875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5148880Z 
2025-05-07T20:32:54.5148959Z     @given(
2025-05-07T20:32:54.5149082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5149181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5149295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5149420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5149575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5149649Z     )
2025-05-07T20:32:54.5149907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5150004Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5150084Z         self,
2025-05-07T20:32:54.5150161Z         T: int,
2025-05-07T20:32:54.5150238Z         D: int,
2025-05-07T20:32:54.5150340Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5150433Z         contiguous: bool,
2025-05-07T20:32:54.5150518Z         compiled: bool,
2025-05-07T20:32:54.5150602Z     ) -> None:
2025-05-07T20:32:54.5150738Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5150814Z     
2025-05-07T20:32:54.5150995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5151071Z     
2025-05-07T20:32:54.5151161Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5151293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5151385Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5151471Z         x0 = x[:, :D]
2025-05-07T20:32:54.5151552Z         x1 = x[:, D:]
2025-05-07T20:32:54.5151624Z     
2025-05-07T20:32:54.5151715Z         if contiguous:
2025-05-07T20:32:54.5151850Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5151940Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5152017Z     
2025-05-07T20:32:54.5152108Z         if scale_ub is not None:
2025-05-07T20:32:54.5152218Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5152364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5152443Z             )
2025-05-07T20:32:54.5152523Z         else:
2025-05-07T20:32:54.5152623Z             scale_ub_tensor = None
2025-05-07T20:32:54.5152696Z     
2025-05-07T20:32:54.5152828Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5152925Z             op = silu_mul_quant
2025-05-07T20:32:54.5153009Z             if compiled:
2025-05-07T20:32:54.5153120Z                 op = torch.compile(op)
2025-05-07T20:32:54.5153224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5153295Z     
2025-05-07T20:32:54.5153392Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5153400Z 
2025-05-07T20:32:54.5153499Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5153632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5153740Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5153843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5154367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5154466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5154836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5155071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5155422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5155517Z     kernel = self.compile(
2025-05-07T20:32:54.5155921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5156097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5156284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5156289Z 
2025-05-07T20:32:54.5156504Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0cba10>
2025-05-07T20:32:54.5157309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5157834Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e2160>}
2025-05-07T20:32:54.5158672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5158877Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c2a4af0>
2025-05-07T20:32:54.5158885Z 
2025-05-07T20:32:54.5159069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5159414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5159528Z                            module_map=module_map)
2025-05-07T20:32:54.5159692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5159797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5159871Z E       ^
2025-05-07T20:32:54.5160236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5160240Z 
2025-05-07T20:32:54.5160671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5160747Z 
2025-05-07T20:32:54.5160853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5161091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5161167Z     T=4096,
2025-05-07T20:32:54.5161243Z     D=5120,
2025-05-07T20:32:54.5161331Z     scale_ub=1200.0,
2025-05-07T20:32:54.5161420Z     contiguous=False,
2025-05-07T20:32:54.5161504Z     compiled=True,
2025-05-07T20:32:54.5161584Z )
2025-05-07T20:32:54.5161809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5161989Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5161997Z 
2025-05-07T20:32:54.5162082Z     @given(
2025-05-07T20:32:54.5162203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5162307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5162423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5162543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5162663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5162739Z     )
2025-05-07T20:32:54.5162993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5163096Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5163176Z         self,
2025-05-07T20:32:54.5163254Z         T: int,
2025-05-07T20:32:54.5163338Z         D: int,
2025-05-07T20:32:54.5163436Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5163526Z         contiguous: bool,
2025-05-07T20:32:54.5163618Z         compiled: bool,
2025-05-07T20:32:54.5163695Z     ) -> None:
2025-05-07T20:32:54.5163796Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5163868Z     
2025-05-07T20:32:54.5164038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5164117Z     
2025-05-07T20:32:54.5164208Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5164340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5164434Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5164513Z         x0 = x[:, :D]
2025-05-07T20:32:54.5164640Z         x1 = x[:, D:]
2025-05-07T20:32:54.5164720Z     
2025-05-07T20:32:54.5164810Z         if contiguous:
2025-05-07T20:32:54.5164901Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5165000Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5165071Z     
2025-05-07T20:32:54.5165167Z         if scale_ub is not None:
2025-05-07T20:32:54.5165272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5165409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5165491Z             )
2025-05-07T20:32:54.5165611Z         else:
2025-05-07T20:32:54.5165706Z             scale_ub_tensor = None
2025-05-07T20:32:54.5165784Z     
2025-05-07T20:32:54.5165915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5166009Z             op = silu_mul_quant
2025-05-07T20:32:54.5166101Z             if compiled:
2025-05-07T20:32:54.5166201Z                 op = torch.compile(op)
2025-05-07T20:32:54.5166309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5166393Z     
2025-05-07T20:32:54.5166484Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5166488Z 
2025-05-07T20:32:54.5166594Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5166771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5166872Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5166979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5167356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5167450Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5167969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5168106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5168481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5168712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5169064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5169166Z     kernel = self.compile(
2025-05-07T20:32:54.5169559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5169737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5169878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5169883Z 
2025-05-07T20:32:54.5170094Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c0c9be0>
2025-05-07T20:32:54.5170907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5171431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c0e3240>}
2025-05-07T20:32:54.5172322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5172517Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ca353f0>
2025-05-07T20:32:54.5172524Z 
2025-05-07T20:32:54.5172692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5172971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5173083Z                            module_map=module_map)
2025-05-07T20:32:54.5173256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5173400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5173480Z E       ^
2025-05-07T20:32:54.5173918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5173922Z 
2025-05-07T20:32:54.5174421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5174425Z 
2025-05-07T20:32:54.5174534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5174799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5174920Z     T=2048,
2025-05-07T20:32:54.5175003Z     D=7168,
2025-05-07T20:32:54.5175088Z     scale_ub=1200.0,
2025-05-07T20:32:54.5175179Z     contiguous=False,
2025-05-07T20:32:54.5175273Z     compiled=False,
2025-05-07T20:32:54.5175347Z )
2025-05-07T20:32:54.5175597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5175803Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5175808Z 
2025-05-07T20:32:54.5175886Z     @given(
2025-05-07T20:32:54.5176010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5176162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5176286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5176416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5176539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5176617Z     )
2025-05-07T20:32:54.5176910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5177006Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5177084Z         self,
2025-05-07T20:32:54.5177213Z         T: int,
2025-05-07T20:32:54.5177291Z         D: int,
2025-05-07T20:32:54.5177392Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5177489Z         contiguous: bool,
2025-05-07T20:32:54.5177578Z         compiled: bool,
2025-05-07T20:32:54.5177660Z     ) -> None:
2025-05-07T20:32:54.5177765Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5177840Z     
2025-05-07T20:32:54.5178035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5178109Z     
2025-05-07T20:32:54.5178205Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5178343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5178435Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5178518Z         x0 = x[:, :D]
2025-05-07T20:32:54.5178609Z         x1 = x[:, D:]
2025-05-07T20:32:54.5178683Z     
2025-05-07T20:32:54.5178790Z         if contiguous:
2025-05-07T20:32:54.5178899Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5179010Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5179087Z     
2025-05-07T20:32:54.5179185Z         if scale_ub is not None:
2025-05-07T20:32:54.5179294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5179447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5179524Z             )
2025-05-07T20:32:54.5179603Z         else:
2025-05-07T20:32:54.5179708Z             scale_ub_tensor = None
2025-05-07T20:32:54.5179781Z     
2025-05-07T20:32:54.5179921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5180022Z             op = silu_mul_quant
2025-05-07T20:32:54.5180108Z             if compiled:
2025-05-07T20:32:54.5180211Z                 op = torch.compile(op)
2025-05-07T20:32:54.5180328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5180405Z     
2025-05-07T20:32:54.5180501Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5180505Z 
2025-05-07T20:32:54.5180615Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5180756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5180871Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5180975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5181635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5181746Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5182179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5182437Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5182854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5182992Z     kernel = self.compile(
2025-05-07T20:32:54.5183462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5183660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5183798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5183804Z 
2025-05-07T20:32:54.5184044Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca835f0>
2025-05-07T20:32:54.5185058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5185689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a4220>}
2025-05-07T20:32:54.5186618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5186885Z context = <triton._C.libtriton.ir.context object at 0x7fdb0ce13bb0>
2025-05-07T20:32:54.5186889Z 
2025-05-07T20:32:54.5187074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5187387Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5187513Z                            module_map=module_map)
2025-05-07T20:32:54.5187689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5187793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5187879Z E       ^
2025-05-07T20:32:54.5188306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5188312Z 
2025-05-07T20:32:54.5188819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5188824Z 
2025-05-07T20:32:54.5188937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5189193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5189280Z     T=1,
2025-05-07T20:32:54.5189359Z     D=7168,
2025-05-07T20:32:54.5189447Z     scale_ub=None,
2025-05-07T20:32:54.5189544Z     contiguous=True,
2025-05-07T20:32:54.5189631Z     compiled=False,
2025-05-07T20:32:54.5189707Z )
2025-05-07T20:32:54.5189964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5190147Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5190152Z 
2025-05-07T20:32:54.5190236Z     @given(
2025-05-07T20:32:54.5190361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5190468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5190595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5190722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5190846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5190927Z     )
2025-05-07T20:32:54.5191214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5191363Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5191444Z         self,
2025-05-07T20:32:54.5191522Z         T: int,
2025-05-07T20:32:54.5191605Z         D: int,
2025-05-07T20:32:54.5191710Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5191805Z         contiguous: bool,
2025-05-07T20:32:54.5191902Z         compiled: bool,
2025-05-07T20:32:54.5191984Z     ) -> None:
2025-05-07T20:32:54.5192082Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5192161Z     
2025-05-07T20:32:54.5192348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5192468Z     
2025-05-07T20:32:54.5192568Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5192699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5192798Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5192881Z         x0 = x[:, :D]
2025-05-07T20:32:54.5192965Z         x1 = x[:, D:]
2025-05-07T20:32:54.5193059Z     
2025-05-07T20:32:54.5193153Z         if contiguous:
2025-05-07T20:32:54.5193252Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5193345Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5193427Z     
2025-05-07T20:32:54.5193523Z         if scale_ub is not None:
2025-05-07T20:32:54.5193714Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5193860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5193937Z             )
2025-05-07T20:32:54.5194018Z         else:
2025-05-07T20:32:54.5194112Z             scale_ub_tensor = None
2025-05-07T20:32:54.5194185Z     
2025-05-07T20:32:54.5194326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5194416Z             op = silu_mul_quant
2025-05-07T20:32:54.5194501Z             if compiled:
2025-05-07T20:32:54.5194608Z                 op = torch.compile(op)
2025-05-07T20:32:54.5194758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5194831Z     
2025-05-07T20:32:54.5194930Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5194934Z 
2025-05-07T20:32:54.5195033Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5195172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5195275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5195374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5195895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5195994Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5196365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5196600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5196954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5197055Z     kernel = self.compile(
2025-05-07T20:32:54.5197451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5197628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5197769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5197773Z 
2025-05-07T20:32:54.5197982Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca83ad0>
2025-05-07T20:32:54.5198792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5199315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a5120>}
2025-05-07T20:32:54.5200269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5200475Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1eb7f0>
2025-05-07T20:32:54.5200479Z 
2025-05-07T20:32:54.5200646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5200921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5201028Z                            module_map=module_map)
2025-05-07T20:32:54.5201233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5201337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5201413Z E       ^
2025-05-07T20:32:54.5201776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5201789Z 
2025-05-07T20:32:54.5202216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5202221Z 
2025-05-07T20:32:54.5202323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5202599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5202678Z     T=16384,
2025-05-07T20:32:54.5202755Z     D=7168,
2025-05-07T20:32:54.5202845Z     scale_ub=1200.0,
2025-05-07T20:32:54.5202932Z     contiguous=False,
2025-05-07T20:32:54.5203012Z     compiled=True,
2025-05-07T20:32:54.5203093Z )
2025-05-07T20:32:54.5203318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5203512Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5203516Z 
2025-05-07T20:32:54.5203634Z     @given(
2025-05-07T20:32:54.5203753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5203857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5203973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5204094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5204214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5204291Z     )
2025-05-07T20:32:54.5204549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5204644Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5204719Z         self,
2025-05-07T20:32:54.5204802Z         T: int,
2025-05-07T20:32:54.5204879Z         D: int,
2025-05-07T20:32:54.5204982Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5205076Z         contiguous: bool,
2025-05-07T20:32:54.5205162Z         compiled: bool,
2025-05-07T20:32:54.5205240Z     ) -> None:
2025-05-07T20:32:54.5205341Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5205414Z     
2025-05-07T20:32:54.5205584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5205663Z     
2025-05-07T20:32:54.5205757Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5205883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5205978Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5206058Z         x0 = x[:, :D]
2025-05-07T20:32:54.5206419Z         x1 = x[:, D:]
2025-05-07T20:32:54.5206526Z     
2025-05-07T20:32:54.5206643Z         if contiguous:
2025-05-07T20:32:54.5206771Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5206891Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5206997Z     
2025-05-07T20:32:54.5207096Z         if scale_ub is not None:
2025-05-07T20:32:54.5207208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5207346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5207427Z             )
2025-05-07T20:32:54.5207505Z         else:
2025-05-07T20:32:54.5207605Z             scale_ub_tensor = None
2025-05-07T20:32:54.5207689Z     
2025-05-07T20:32:54.5207821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5207919Z             op = silu_mul_quant
2025-05-07T20:32:54.5208198Z             if compiled:
2025-05-07T20:32:54.5208302Z                 op = torch.compile(op)
2025-05-07T20:32:54.5208417Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5208491Z     
2025-05-07T20:32:54.5208582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5208586Z 
2025-05-07T20:32:54.5208692Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5208823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5208997Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5209105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5209481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5209587Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5210096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5210195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5210570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5210858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5211211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5211311Z     kernel = self.compile(
2025-05-07T20:32:54.5211706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5211963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5212097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5212175Z 
2025-05-07T20:32:54.5212387Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca81a90>
2025-05-07T20:32:54.5213204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5213726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a6520>}
2025-05-07T20:32:54.5214514Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5214715Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c1af9b0>
2025-05-07T20:32:54.5214722Z 
2025-05-07T20:32:54.5214897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5215172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5215284Z                            module_map=module_map)
2025-05-07T20:32:54.5215457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5215560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5215636Z E       ^
2025-05-07T20:32:54.5216013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5216018Z 
2025-05-07T20:32:54.5216455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5216462Z 
2025-05-07T20:32:54.5216574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5216809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5216890Z     T=1,
2025-05-07T20:32:54.5216973Z     D=7168,
2025-05-07T20:32:54.5217055Z     scale_ub=None,
2025-05-07T20:32:54.5217140Z     contiguous=False,
2025-05-07T20:32:54.5217231Z     compiled=False,
2025-05-07T20:32:54.5217353Z )
2025-05-07T20:32:54.5217582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5217761Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5217766Z 
2025-05-07T20:32:54.5217843Z     @given(
2025-05-07T20:32:54.5217969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5218069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5218184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5218355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5218488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5218567Z     )
2025-05-07T20:32:54.5218852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5218947Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5219033Z         self,
2025-05-07T20:32:54.5219110Z         T: int,
2025-05-07T20:32:54.5219189Z         D: int,
2025-05-07T20:32:54.5219292Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5219381Z         contiguous: bool,
2025-05-07T20:32:54.5219467Z         compiled: bool,
2025-05-07T20:32:54.5219591Z     ) -> None:
2025-05-07T20:32:54.5219687Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5219762Z     
2025-05-07T20:32:54.5219939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5220011Z     
2025-05-07T20:32:54.5220104Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5220236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5220325Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5220404Z         x0 = x[:, :D]
2025-05-07T20:32:54.5220493Z         x1 = x[:, D:]
2025-05-07T20:32:54.5220606Z     
2025-05-07T20:32:54.5220695Z         if contiguous:
2025-05-07T20:32:54.5220785Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5220874Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5220954Z     
2025-05-07T20:32:54.5221049Z         if scale_ub is not None:
2025-05-07T20:32:54.5221154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5221298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5221373Z             )
2025-05-07T20:32:54.5221448Z         else:
2025-05-07T20:32:54.5221547Z             scale_ub_tensor = None
2025-05-07T20:32:54.5221624Z     
2025-05-07T20:32:54.5221753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5221849Z             op = silu_mul_quant
2025-05-07T20:32:54.5221938Z             if compiled:
2025-05-07T20:32:54.5222042Z                 op = torch.compile(op)
2025-05-07T20:32:54.5222147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5222221Z     
2025-05-07T20:32:54.5222318Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5222323Z 
2025-05-07T20:32:54.5222419Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5222553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5222661Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5222762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5223286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5223389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5223763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5224001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5224357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5224455Z     kernel = self.compile(
2025-05-07T20:32:54.5224862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5225089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5225227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5225231Z 
2025-05-07T20:32:54.5225447Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0ca83470>
2025-05-07T20:32:54.5226267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5226882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c1a7100>}
2025-05-07T20:32:54.5227670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5227874Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c76b870>
2025-05-07T20:32:54.5227878Z 
2025-05-07T20:32:54.5228084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5228357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5228469Z                            module_map=module_map)
2025-05-07T20:32:54.5228634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5228743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5228820Z E       ^
2025-05-07T20:32:54.5229186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5229232Z 
2025-05-07T20:32:54.5229669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5229674Z 
2025-05-07T20:32:54.5229780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5230022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5230100Z     T=2048,
2025-05-07T20:32:54.5230181Z     D=7168,
2025-05-07T20:32:54.5230272Z     scale_ub=None,
2025-05-07T20:32:54.5230360Z     contiguous=False,
2025-05-07T20:32:54.5230446Z     compiled=True,
2025-05-07T20:32:54.5230525Z )
2025-05-07T20:32:54.5230752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5230931Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5230938Z 
2025-05-07T20:32:54.5231025Z     @given(
2025-05-07T20:32:54.5231146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5231252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5231369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5231487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5231610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5231683Z     )
2025-05-07T20:32:54.5231934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5232035Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5232111Z         self,
2025-05-07T20:32:54.5232187Z         T: int,
2025-05-07T20:32:54.5232270Z         D: int,
2025-05-07T20:32:54.5232366Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5232455Z         contiguous: bool,
2025-05-07T20:32:54.5232545Z         compiled: bool,
2025-05-07T20:32:54.5232625Z     ) -> None:
2025-05-07T20:32:54.5232724Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5232797Z     
2025-05-07T20:32:54.5232967Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5233047Z     
2025-05-07T20:32:54.5233139Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5233264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5233357Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5233483Z         x0 = x[:, :D]
2025-05-07T20:32:54.5233569Z         x1 = x[:, D:]
2025-05-07T20:32:54.5233646Z     
2025-05-07T20:32:54.5233730Z         if contiguous:
2025-05-07T20:32:54.5233824Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5233920Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5233992Z     
2025-05-07T20:32:54.5234083Z         if scale_ub is not None:
2025-05-07T20:32:54.5234194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5234330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5234456Z             )
2025-05-07T20:32:54.5234532Z         else:
2025-05-07T20:32:54.5234627Z             scale_ub_tensor = None
2025-05-07T20:32:54.5234708Z     
2025-05-07T20:32:54.5234841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5234933Z             op = silu_mul_quant
2025-05-07T20:32:54.5235024Z             if compiled:
2025-05-07T20:32:54.5235122Z                 op = torch.compile(op)
2025-05-07T20:32:54.5235230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5235310Z     
2025-05-07T20:32:54.5235399Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5235403Z 
2025-05-07T20:32:54.5235552Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5235684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5235785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5235891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5236267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5236364Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5236877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5237014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5237392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5237621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5237972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5238077Z     kernel = self.compile(
2025-05-07T20:32:54.5238470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5238649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5238790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5238795Z 
2025-05-07T20:32:54.5239033Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c2ba0>
2025-05-07T20:32:54.5239866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5240389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d0720>}
2025-05-07T20:32:54.5241170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5241367Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c7f82f0>
2025-05-07T20:32:54.5241372Z 
2025-05-07T20:32:54.5241540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5241822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5241929Z                            module_map=module_map)
2025-05-07T20:32:54.5242138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5242247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5242324Z E       ^
2025-05-07T20:32:54.5242698Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5242703Z 
2025-05-07T20:32:54.5243129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5243134Z 
2025-05-07T20:32:54.5243237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5243515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5243591Z     T=4096,
2025-05-07T20:32:54.5243677Z     D=7168,
2025-05-07T20:32:54.5243765Z     scale_ub=None,
2025-05-07T20:32:54.5243854Z     contiguous=False,
2025-05-07T20:32:54.5243945Z     compiled=True,
2025-05-07T20:32:54.5244019Z )
2025-05-07T20:32:54.5244244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5244431Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5244436Z 
2025-05-07T20:32:54.5244514Z     @given(
2025-05-07T20:32:54.5244679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5244787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5244903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5245027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5245140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5245221Z     )
2025-05-07T20:32:54.5245480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5245574Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5245693Z         self,
2025-05-07T20:32:54.5245776Z         T: int,
2025-05-07T20:32:54.5245853Z         D: int,
2025-05-07T20:32:54.5245952Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5246048Z         contiguous: bool,
2025-05-07T20:32:54.5246138Z         compiled: bool,
2025-05-07T20:32:54.5246216Z     ) -> None:
2025-05-07T20:32:54.5246316Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5246391Z     
2025-05-07T20:32:54.5246570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5246646Z     
2025-05-07T20:32:54.5246741Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5246870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5246958Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5247040Z         x0 = x[:, :D]
2025-05-07T20:32:54.5247125Z         x1 = x[:, D:]
2025-05-07T20:32:54.5247198Z     
2025-05-07T20:32:54.5247280Z         if contiguous:
2025-05-07T20:32:54.5247382Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5247474Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5247546Z     
2025-05-07T20:32:54.5247642Z         if scale_ub is not None:
2025-05-07T20:32:54.5247747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5247885Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5247965Z             )
2025-05-07T20:32:54.5248041Z         else:
2025-05-07T20:32:54.5248145Z             scale_ub_tensor = None
2025-05-07T20:32:54.5248215Z     
2025-05-07T20:32:54.5248344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5248439Z             op = silu_mul_quant
2025-05-07T20:32:54.5248525Z             if compiled:
2025-05-07T20:32:54.5248624Z                 op = torch.compile(op)
2025-05-07T20:32:54.5248738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5248814Z     
2025-05-07T20:32:54.5248912Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5248918Z 
2025-05-07T20:32:54.5249038Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5249195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5249301Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5249402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5249828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5249931Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5250438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5250535Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5250907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5251173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5251527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5251626Z     kernel = self.compile(
2025-05-07T20:32:54.5252114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5252298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5252491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5252496Z 
2025-05-07T20:32:54.5252715Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c0d70>
2025-05-07T20:32:54.5253519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5254038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d1440>}
2025-05-07T20:32:54.5254856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5255054Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c23da30>
2025-05-07T20:32:54.5255058Z 
2025-05-07T20:32:54.5255236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5255506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5255614Z                            module_map=module_map)
2025-05-07T20:32:54.5255788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5255890Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5255966Z E       ^
2025-05-07T20:32:54.5256336Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5256344Z 
2025-05-07T20:32:54.5256770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5256774Z 
2025-05-07T20:32:54.5256885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5257131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5257211Z     T=16384,
2025-05-07T20:32:54.5257296Z     D=5120,
2025-05-07T20:32:54.5257380Z     scale_ub=1200.0,
2025-05-07T20:32:54.5257474Z     contiguous=False,
2025-05-07T20:32:54.5263081Z     compiled=False,
2025-05-07T20:32:54.5263170Z )
2025-05-07T20:32:54.5263421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5263616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5263621Z 
2025-05-07T20:32:54.5263707Z     @given(
2025-05-07T20:32:54.5263829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5263933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5264055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5264286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5264402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5264484Z     )
2025-05-07T20:32:54.5264743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5264838Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5264921Z         self,
2025-05-07T20:32:54.5264999Z         T: int,
2025-05-07T20:32:54.5265079Z         D: int,
2025-05-07T20:32:54.5265183Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5265318Z         contiguous: bool,
2025-05-07T20:32:54.5265410Z         compiled: bool,
2025-05-07T20:32:54.5265489Z     ) -> None:
2025-05-07T20:32:54.5265584Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5265666Z     
2025-05-07T20:32:54.5265844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5265918Z     
2025-05-07T20:32:54.5266017Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5266144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5266237Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5266323Z         x0 = x[:, :D]
2025-05-07T20:32:54.5266403Z         x1 = x[:, D:]
2025-05-07T20:32:54.5266519Z     
2025-05-07T20:32:54.5266610Z         if contiguous:
2025-05-07T20:32:54.5266702Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5266799Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5266871Z     
2025-05-07T20:32:54.5266962Z         if scale_ub is not None:
2025-05-07T20:32:54.5267075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5267215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5267293Z             )
2025-05-07T20:32:54.5267378Z         else:
2025-05-07T20:32:54.5267471Z             scale_ub_tensor = None
2025-05-07T20:32:54.5267586Z     
2025-05-07T20:32:54.5267727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5267818Z             op = silu_mul_quant
2025-05-07T20:32:54.5267902Z             if compiled:
2025-05-07T20:32:54.5268019Z                 op = torch.compile(op)
2025-05-07T20:32:54.5268124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5268205Z     
2025-05-07T20:32:54.5268299Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5268304Z 
2025-05-07T20:32:54.5268402Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5268543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5268651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5268762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5269328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5269431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5269809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5270039Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5270392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5270497Z     kernel = self.compile(
2025-05-07T20:32:54.5270890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5271071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5271209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5271215Z 
2025-05-07T20:32:54.5271425Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c3e90>
2025-05-07T20:32:54.5272239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5272809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d2340>}
2025-05-07T20:32:54.5273591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5273788Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c230fb0>
2025-05-07T20:32:54.5273831Z 
2025-05-07T20:32:54.5274000Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5274277Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5274388Z                            module_map=module_map)
2025-05-07T20:32:54.5274553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5274656Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5274732Z E       ^
2025-05-07T20:32:54.5275107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5275112Z 
2025-05-07T20:32:54.5275577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5275582Z 
2025-05-07T20:32:54.5275687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5275922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5276001Z     T=16384,
2025-05-07T20:32:54.5276082Z     D=5120,
2025-05-07T20:32:54.5276166Z     scale_ub=1200.0,
2025-05-07T20:32:54.5276252Z     contiguous=True,
2025-05-07T20:32:54.5276340Z     compiled=True,
2025-05-07T20:32:54.5276457Z )
2025-05-07T20:32:54.5276682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5276869Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5276877Z 
2025-05-07T20:32:54.5276954Z     @given(
2025-05-07T20:32:54.5277075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5277182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5277297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5277421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5277535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5277609Z     )
2025-05-07T20:32:54.5277867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5277962Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5278039Z         self,
2025-05-07T20:32:54.5278122Z         T: int,
2025-05-07T20:32:54.5278202Z         D: int,
2025-05-07T20:32:54.5278300Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5278396Z         contiguous: bool,
2025-05-07T20:32:54.5278481Z         compiled: bool,
2025-05-07T20:32:54.5278559Z     ) -> None:
2025-05-07T20:32:54.5278685Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5278762Z     
2025-05-07T20:32:54.5278965Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5279042Z     
2025-05-07T20:32:54.5279135Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5279266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5279354Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5279433Z         x0 = x[:, :D]
2025-05-07T20:32:54.5279520Z         x1 = x[:, D:]
2025-05-07T20:32:54.5279593Z     
2025-05-07T20:32:54.5279678Z         if contiguous:
2025-05-07T20:32:54.5279775Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5279864Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5279937Z     
2025-05-07T20:32:54.5280037Z         if scale_ub is not None:
2025-05-07T20:32:54.5280142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5280280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5280361Z             )
2025-05-07T20:32:54.5280489Z         else:
2025-05-07T20:32:54.5280591Z             scale_ub_tensor = None
2025-05-07T20:32:54.5280666Z     
2025-05-07T20:32:54.5280797Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5280892Z             op = silu_mul_quant
2025-05-07T20:32:54.5280978Z             if compiled:
2025-05-07T20:32:54.5281078Z                 op = torch.compile(op)
2025-05-07T20:32:54.5281191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5281265Z     
2025-05-07T20:32:54.5281398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5281403Z 
2025-05-07T20:32:54.5281507Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5281644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5281753Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5281853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5282234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5282333Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5282882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5282981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5283355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5283582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5283943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5284037Z     kernel = self.compile(
2025-05-07T20:32:54.5284471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5284662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5284796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5284801Z 
2025-05-07T20:32:54.5285019Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c7c3d70>
2025-05-07T20:32:54.5285821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5286347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0c7d39c0>}
2025-05-07T20:32:54.5287129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5287329Z context = <triton._C.libtriton.ir.context object at 0x7fdb0c2037f0>
2025-05-07T20:32:54.5287334Z 
2025-05-07T20:32:54.5287509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5287783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5287893Z                            module_map=module_map)
2025-05-07T20:32:54.5288063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5288164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5288243Z E       ^
2025-05-07T20:32:54.5288656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5288662Z 
2025-05-07T20:32:54.5289103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5289110Z 
2025-05-07T20:32:54.5289219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5289496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5289577Z     T=16384,
2025-05-07T20:32:54.5289661Z     D=5120,
2025-05-07T20:32:54.5289745Z     scale_ub=None,
2025-05-07T20:32:54.5289841Z     contiguous=False,
2025-05-07T20:32:54.5289927Z     compiled=True,
2025-05-07T20:32:54.5289999Z )
2025-05-07T20:32:54.5290232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5290414Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5290460Z 
2025-05-07T20:32:54.5290536Z     @given(
2025-05-07T20:32:54.5290664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5290764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5290884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5291008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5291121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5291201Z     )
2025-05-07T20:32:54.5291456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5291547Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5291670Z         self,
2025-05-07T20:32:54.5291749Z         T: int,
2025-05-07T20:32:54.5291950Z         D: int,
2025-05-07T20:32:54.5292054Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5292143Z         contiguous: bool,
2025-05-07T20:32:54.5292227Z         compiled: bool,
2025-05-07T20:32:54.5292316Z     ) -> None:
2025-05-07T20:32:54.5292412Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5292485Z     
2025-05-07T20:32:54.5292660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5292732Z     
2025-05-07T20:32:54.5292916Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5293043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5293131Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5293216Z         x0 = x[:, :D]
2025-05-07T20:32:54.5293299Z         x1 = x[:, D:]
2025-05-07T20:32:54.5293371Z     
2025-05-07T20:32:54.5293462Z         if contiguous:
2025-05-07T20:32:54.5293555Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5293647Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5293727Z     
2025-05-07T20:32:54.5293816Z         if scale_ub is not None:
2025-05-07T20:32:54.5293920Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5294064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5294142Z             )
2025-05-07T20:32:54.5294218Z         else:
2025-05-07T20:32:54.5294318Z             scale_ub_tensor = None
2025-05-07T20:32:54.5294390Z     
2025-05-07T20:32:54.5294524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5294618Z             op = silu_mul_quant
2025-05-07T20:32:54.5294705Z             if compiled:
2025-05-07T20:32:54.5294810Z                 op = torch.compile(op)
2025-05-07T20:32:54.5294914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5294987Z     
2025-05-07T20:32:54.5295087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5295091Z 
2025-05-07T20:32:54.5295187Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5295323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5295430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5295531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5295916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5296010Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5296521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5296626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5296994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5297274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5297639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5297735Z     kernel = self.compile(
2025-05-07T20:32:54.5298139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5298318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5298518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5298523Z 
2025-05-07T20:32:54.5298740Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25c7a0>
2025-05-07T20:32:54.5299553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5300120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce40c20>}
2025-05-07T20:32:54.5300893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5301096Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cec2570>
2025-05-07T20:32:54.5301103Z 
2025-05-07T20:32:54.5301273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5301544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5301700Z                            module_map=module_map)
2025-05-07T20:32:54.5301865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5301969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5302056Z E       ^
2025-05-07T20:32:54.5302423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5302427Z 
2025-05-07T20:32:54.5302860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5302864Z 
2025-05-07T20:32:54.5302967Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5303201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5303285Z     T=2048,
2025-05-07T20:32:54.5303361Z     D=5120,
2025-05-07T20:32:54.5303448Z     scale_ub=None,
2025-05-07T20:32:54.5303542Z     contiguous=False,
2025-05-07T20:32:54.5303629Z     compiled=True,
2025-05-07T20:32:54.5303702Z )
2025-05-07T20:32:54.5303932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5304114Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5304118Z 
2025-05-07T20:32:54.5304204Z     @given(
2025-05-07T20:32:54.5304327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5304427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5304552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5304670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5304784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5304868Z     )
2025-05-07T20:32:54.5305123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5305219Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5305295Z         self,
2025-05-07T20:32:54.5305374Z         T: int,
2025-05-07T20:32:54.5305456Z         D: int,
2025-05-07T20:32:54.5305554Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5305642Z         contiguous: bool,
2025-05-07T20:32:54.5305732Z         compiled: bool,
2025-05-07T20:32:54.5305856Z     ) -> None:
2025-05-07T20:32:54.5305952Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5306032Z     
2025-05-07T20:32:54.5306642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5306751Z     
2025-05-07T20:32:54.5306887Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5307019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5307115Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5307197Z         x0 = x[:, :D]
2025-05-07T20:32:54.5307468Z         x1 = x[:, D:]
2025-05-07T20:32:54.5307548Z     
2025-05-07T20:32:54.5307631Z         if contiguous:
2025-05-07T20:32:54.5307722Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5307816Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5307889Z     
2025-05-07T20:32:54.5307978Z         if scale_ub is not None:
2025-05-07T20:32:54.5308088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5308228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5308305Z             )
2025-05-07T20:32:54.5308407Z         else:
2025-05-07T20:32:54.5308509Z             scale_ub_tensor = None
2025-05-07T20:32:54.5308671Z     
2025-05-07T20:32:54.5308810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5308899Z             op = silu_mul_quant
2025-05-07T20:32:54.5308991Z             if compiled:
2025-05-07T20:32:54.5309091Z                 op = torch.compile(op)
2025-05-07T20:32:54.5309196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5309277Z     
2025-05-07T20:32:54.5309367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5309372Z 
2025-05-07T20:32:54.5309469Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5309605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5309776Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5309877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5310262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5310355Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5310872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5310970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5311337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5311574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5311923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5312027Z     kernel = self.compile(
2025-05-07T20:32:54.5312421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5312601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5312739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5312743Z 
2025-05-07T20:32:54.5312960Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25f2c0>
2025-05-07T20:32:54.5313766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5314297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce419e0>}
2025-05-07T20:32:54.5315074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5315349Z context = <triton._C.libtriton.ir.context object at 0x7fdb0cea9db0>
2025-05-07T20:32:54.5315354Z 
2025-05-07T20:32:54.5315530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5315808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5315916Z                            module_map=module_map)
2025-05-07T20:32:54.5316082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5316230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5316308Z E       ^
2025-05-07T20:32:54.5316676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5316692Z 
2025-05-07T20:32:54.5317119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5317124Z 
2025-05-07T20:32:54.5317230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5317469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5317547Z     T=2048,
2025-05-07T20:32:54.5317663Z     D=5120,
2025-05-07T20:32:54.5317755Z     scale_ub=1200.0,
2025-05-07T20:32:54.5317842Z     contiguous=False,
2025-05-07T20:32:54.5317926Z     compiled=True,
2025-05-07T20:32:54.5318006Z )
2025-05-07T20:32:54.5318232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5318422Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5318426Z 
2025-05-07T20:32:54.5318504Z     @given(
2025-05-07T20:32:54.5318623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5318792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5318907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5319041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5319181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5319272Z     )
2025-05-07T20:32:54.5319527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5319634Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5319710Z         self,
2025-05-07T20:32:54.5319794Z         T: int,
2025-05-07T20:32:54.5319870Z         D: int,
2025-05-07T20:32:54.5319972Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5320068Z         contiguous: bool,
2025-05-07T20:32:54.5320156Z         compiled: bool,
2025-05-07T20:32:54.5320233Z     ) -> None:
2025-05-07T20:32:54.5320335Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5320410Z     
2025-05-07T20:32:54.5320580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5320666Z     
2025-05-07T20:32:54.5320757Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5320885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5320976Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5321061Z         x0 = x[:, :D]
2025-05-07T20:32:54.5321146Z         x1 = x[:, D:]
2025-05-07T20:32:54.5321218Z     
2025-05-07T20:32:54.5321300Z         if contiguous:
2025-05-07T20:32:54.5321400Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5321489Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5321561Z     
2025-05-07T20:32:54.5321656Z         if scale_ub is not None:
2025-05-07T20:32:54.5321760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5321896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5321982Z             )
2025-05-07T20:32:54.5322059Z         else:
2025-05-07T20:32:54.5322153Z             scale_ub_tensor = None
2025-05-07T20:32:54.5322230Z     
2025-05-07T20:32:54.5322364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5322473Z             op = silu_mul_quant
2025-05-07T20:32:54.5322567Z             if compiled:
2025-05-07T20:32:54.5322668Z                 op = torch.compile(op)
2025-05-07T20:32:54.5322824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5322905Z     
2025-05-07T20:32:54.5322997Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5323004Z 
2025-05-07T20:32:54.5323102Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5323239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5323342Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5323452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5323871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5323964Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5324480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5324581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5324952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5325187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5325608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5325711Z     kernel = self.compile(
2025-05-07T20:32:54.5326107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5326288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5326425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5326429Z 
2025-05-07T20:32:54.5326680Z self = <triton.compiler.compiler.ASTSource object at 0x7fdb0c25f860>
2025-05-07T20:32:54.5327493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5328016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fdb0ce42b60>}
2025-05-07T20:32:54.5328791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5328994Z context = <triton._C.libtriton.ir.context object at 0x7fd871f2e1f0>
2025-05-07T20:32:54.5328999Z 
2025-05-07T20:32:54.5329168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5329450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5329560Z                            module_map=module_map)
2025-05-07T20:32:54.5329727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5329833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5329912Z E       ^
2025-05-07T20:32:54.5330286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5330290Z 
2025-05-07T20:32:54.5330719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5330724Z 
2025-05-07T20:32:54.5330831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5331068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5331144Z     T=4096,
2025-05-07T20:32:54.5331219Z     D=5120,
2025-05-07T20:32:54.5331310Z     scale_ub=1200.0,
2025-05-07T20:32:54.5331396Z     contiguous=True,
2025-05-07T20:32:54.5331485Z     compiled=True,
2025-05-07T20:32:54.5331557Z )
2025-05-07T20:32:54.5331906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5332096Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5332100Z 
2025-05-07T20:32:54.5332177Z     @given(
2025-05-07T20:32:54.5332300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5332410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5332525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5332644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5333393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5333466Z     )
2025-05-07T20:32:54.5333727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5333820Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5333898Z         self,
2025-05-07T20:32:54.5333980Z         T: int,
2025-05-07T20:32:54.5334055Z         D: int,
2025-05-07T20:32:54.5334152Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5334249Z         contiguous: bool,
2025-05-07T20:32:54.5334336Z         compiled: bool,
2025-05-07T20:32:54.5334414Z     ) -> None:
2025-05-07T20:32:54.5334514Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5334630Z     
2025-05-07T20:32:54.5334803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5334888Z     
2025-05-07T20:32:54.5334979Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5335110Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5335199Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5335283Z         x0 = x[:, :D]
2025-05-07T20:32:54.5335368Z         x1 = x[:, D:]
2025-05-07T20:32:54.5335441Z     
2025-05-07T20:32:54.5335523Z         if contiguous:
2025-05-07T20:32:54.5335623Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5335844Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5335916Z     
2025-05-07T20:32:54.5336014Z         if scale_ub is not None:
2025-05-07T20:32:54.5336120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5336261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5336348Z             )
2025-05-07T20:32:54.5336424Z         else:
2025-05-07T20:32:54.5336520Z             scale_ub_tensor = None
2025-05-07T20:32:54.5336600Z     
2025-05-07T20:32:54.5336730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5336828Z             op = silu_mul_quant
2025-05-07T20:32:54.5336913Z             if compiled:
2025-05-07T20:32:54.5337013Z                 op = torch.compile(op)
2025-05-07T20:32:54.5337127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5337200Z     
2025-05-07T20:32:54.5337294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5337298Z 
2025-05-07T20:32:54.5337404Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5337536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5337638Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5337746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5338124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5338227Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5338737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5338835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5339212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5339444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5339799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5339896Z     kernel = self.compile(
2025-05-07T20:32:54.5340336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5340526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5340660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5340664Z 
2025-05-07T20:32:54.5340873Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fd8620>
2025-05-07T20:32:54.5341687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5342253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f58220>}
2025-05-07T20:32:54.5343047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5343243Z context = <triton._C.libtriton.ir.context object at 0x7fd871f74170>
2025-05-07T20:32:54.5343248Z 
2025-05-07T20:32:54.5343463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5343735Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5343845Z                            module_map=module_map)
2025-05-07T20:32:54.5344019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5344117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5344194Z E       ^
2025-05-07T20:32:54.5344565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5344610Z 
2025-05-07T20:32:54.5345040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5345049Z 
2025-05-07T20:32:54.5345160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5345393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5345470Z     T=128,
2025-05-07T20:32:54.5345554Z     D=5120,
2025-05-07T20:32:54.5345638Z     scale_ub=1200.0,
2025-05-07T20:32:54.5345724Z     contiguous=False,
2025-05-07T20:32:54.5345817Z     compiled=True,
2025-05-07T20:32:54.5345889Z )
2025-05-07T20:32:54.5346119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5346299Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5346303Z 
2025-05-07T20:32:54.5346381Z     @given(
2025-05-07T20:32:54.5346509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5346610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5346724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5346851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5346964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5347039Z     )
2025-05-07T20:32:54.5347299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5347392Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5347474Z         self,
2025-05-07T20:32:54.5347549Z         T: int,
2025-05-07T20:32:54.5347626Z         D: int,
2025-05-07T20:32:54.5347729Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5347819Z         contiguous: bool,
2025-05-07T20:32:54.5347904Z         compiled: bool,
2025-05-07T20:32:54.5347990Z     ) -> None:
2025-05-07T20:32:54.5348084Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5348155Z     
2025-05-07T20:32:54.5348335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5348408Z     
2025-05-07T20:32:54.5348499Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5348676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5348764Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5348848Z         x0 = x[:, :D]
2025-05-07T20:32:54.5348927Z         x1 = x[:, D:]
2025-05-07T20:32:54.5349002Z     
2025-05-07T20:32:54.5349092Z         if contiguous:
2025-05-07T20:32:54.5349182Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5349271Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5349350Z     
2025-05-07T20:32:54.5349442Z         if scale_ub is not None:
2025-05-07T20:32:54.5349549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5349734Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5349811Z             )
2025-05-07T20:32:54.5349890Z         else:
2025-05-07T20:32:54.5349991Z             scale_ub_tensor = None
2025-05-07T20:32:54.5350068Z     
2025-05-07T20:32:54.5350200Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5350296Z             op = silu_mul_quant
2025-05-07T20:32:54.5350381Z             if compiled:
2025-05-07T20:32:54.5350486Z                 op = torch.compile(op)
2025-05-07T20:32:54.5350591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5350664Z     
2025-05-07T20:32:54.5350802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5350807Z 
2025-05-07T20:32:54.5350906Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5351038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5351143Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5351247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5351630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5351722Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5352271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5352374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5352743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5352975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5353331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5353425Z     kernel = self.compile(
2025-05-07T20:32:54.5353824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5354005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5354134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5354142Z 
2025-05-07T20:32:54.5354356Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fd9460>
2025-05-07T20:32:54.5355165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5355695Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f58ea0>}
2025-05-07T20:32:54.5356468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5356664Z context = <triton._C.libtriton.ir.context object at 0x7fd871eaa370>
2025-05-07T20:32:54.5356675Z 
2025-05-07T20:32:54.5356851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5357123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5357281Z                            module_map=module_map)
2025-05-07T20:32:54.5357448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5357548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5357636Z E       ^
2025-05-07T20:32:54.5358002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5358007Z 
2025-05-07T20:32:54.5358444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5358519Z 
2025-05-07T20:32:54.5358624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5358852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5358940Z     T=16384,
2025-05-07T20:32:54.5359018Z     D=7168,
2025-05-07T20:32:54.5359112Z     scale_ub=1200.0,
2025-05-07T20:32:54.5359219Z     contiguous=True,
2025-05-07T20:32:54.5359318Z     compiled=True,
2025-05-07T20:32:54.5359403Z )
2025-05-07T20:32:54.5359637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5359818Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5359867Z 
2025-05-07T20:32:54.5359954Z     @given(
2025-05-07T20:32:54.5360075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5360175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5360296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5360417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5360530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5360611Z     )
2025-05-07T20:32:54.5360863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5361002Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5361085Z         self,
2025-05-07T20:32:54.5361163Z         T: int,
2025-05-07T20:32:54.5361247Z         D: int,
2025-05-07T20:32:54.5361349Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5361439Z         contiguous: bool,
2025-05-07T20:32:54.5361531Z         compiled: bool,
2025-05-07T20:32:54.5361609Z     ) -> None:
2025-05-07T20:32:54.5361706Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5361783Z     
2025-05-07T20:32:54.5361951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5362024Z     
2025-05-07T20:32:54.5362121Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5362245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5362336Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5362422Z         x0 = x[:, :D]
2025-05-07T20:32:54.5362500Z         x1 = x[:, D:]
2025-05-07T20:32:54.5362577Z     
2025-05-07T20:32:54.5362663Z         if contiguous:
2025-05-07T20:32:54.5362754Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5362851Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5362924Z     
2025-05-07T20:32:54.5363015Z         if scale_ub is not None:
2025-05-07T20:32:54.5363131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5363267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5363346Z             )
2025-05-07T20:32:54.5363429Z         else:
2025-05-07T20:32:54.5363522Z             scale_ub_tensor = None
2025-05-07T20:32:54.5363594Z     
2025-05-07T20:32:54.5363733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5363822Z             op = silu_mul_quant
2025-05-07T20:32:54.5363908Z             if compiled:
2025-05-07T20:32:54.5364013Z                 op = torch.compile(op)
2025-05-07T20:32:54.5364118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5364196Z     
2025-05-07T20:32:54.5364287Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5364294Z 
2025-05-07T20:32:54.5364388Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5364527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5364677Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5364778Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5365164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5365259Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5365775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5365871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5366279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5366512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5366867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5366961Z     kernel = self.compile(
2025-05-07T20:32:54.5367366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5367545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5367722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5367727Z 
2025-05-07T20:32:54.5367937Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fda630>
2025-05-07T20:32:54.5368740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5369318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f5a0c0>}
2025-05-07T20:32:54.5370138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5370342Z context = <triton._C.libtriton.ir.context object at 0x7fd871a149b0>
2025-05-07T20:32:54.5370347Z 
2025-05-07T20:32:54.5370515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5370791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5370897Z                            module_map=module_map)
2025-05-07T20:32:54.5371064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5371170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5371247Z E       ^
2025-05-07T20:32:54.5371616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5371620Z 
2025-05-07T20:32:54.5372166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5372170Z 
2025-05-07T20:32:54.5372274Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5372515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5372593Z     T=16384,
2025-05-07T20:32:54.5372670Z     D=5120,
2025-05-07T20:32:54.5372760Z     scale_ub=1200.0,
2025-05-07T20:32:54.5372845Z     contiguous=True,
2025-05-07T20:32:54.5372933Z     compiled=False,
2025-05-07T20:32:54.5373018Z )
2025-05-07T20:32:54.5373243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5373424Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5373438Z 
2025-05-07T20:32:54.5373515Z     @given(
2025-05-07T20:32:54.5373636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5373742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5373908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5374030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5374153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5374230Z     )
2025-05-07T20:32:54.5374484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5374587Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5374664Z         self,
2025-05-07T20:32:54.5374741Z         T: int,
2025-05-07T20:32:54.5374826Z         D: int,
2025-05-07T20:32:54.5374967Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5375064Z         contiguous: bool,
2025-05-07T20:32:54.5375149Z         compiled: bool,
2025-05-07T20:32:54.5375227Z     ) -> None:
2025-05-07T20:32:54.5375332Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5375408Z     
2025-05-07T20:32:54.5375578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5375659Z     
2025-05-07T20:32:54.5375752Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5375880Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5375976Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5376099Z         x0 = x[:, :D]
2025-05-07T20:32:54.5376181Z         x1 = x[:, D:]
2025-05-07T20:32:54.5376260Z     
2025-05-07T20:32:54.5376343Z         if contiguous:
2025-05-07T20:32:54.5376433Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5376532Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5376602Z     
2025-05-07T20:32:54.5376698Z         if scale_ub is not None:
2025-05-07T20:32:54.5376803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5376939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5377021Z             )
2025-05-07T20:32:54.5377140Z         else:
2025-05-07T20:32:54.5377235Z             scale_ub_tensor = None
2025-05-07T20:32:54.5377313Z     
2025-05-07T20:32:54.5377447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5377540Z             op = silu_mul_quant
2025-05-07T20:32:54.5377631Z             if compiled:
2025-05-07T20:32:54.5377730Z                 op = torch.compile(op)
2025-05-07T20:32:54.5377837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5377917Z     
2025-05-07T20:32:54.5378008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5378013Z 
2025-05-07T20:32:54.5378124Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5378275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5378401Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5378508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5379022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5379121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5379497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5379726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5380084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5380179Z     kernel = self.compile(
2025-05-07T20:32:54.5380572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5380756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5380890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5380896Z 
2025-05-07T20:32:54.5381109Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871fda3f0>
2025-05-07T20:32:54.5381957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5382481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871f59a80>}
2025-05-07T20:32:54.5383261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5383454Z context = <triton._C.libtriton.ir.context object at 0x7fd871ef3c30>
2025-05-07T20:32:54.5383504Z 
2025-05-07T20:32:54.5383680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5383951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5384060Z                            module_map=module_map)
2025-05-07T20:32:54.5384229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5384329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5384413Z E       ^
2025-05-07T20:32:54.5384818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5384822Z 
2025-05-07T20:32:54.5385250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5385255Z 
2025-05-07T20:32:54.5385363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5385597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5385680Z     T=1,
2025-05-07T20:32:54.5385756Z     D=7168,
2025-05-07T20:32:54.5385839Z     scale_ub=1200.0,
2025-05-07T20:32:54.5385973Z     contiguous=False,
2025-05-07T20:32:54.5386058Z     compiled=False,
2025-05-07T20:32:54.5386132Z )
2025-05-07T20:32:54.5386361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5386539Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5386543Z 
2025-05-07T20:32:54.5386621Z     @given(
2025-05-07T20:32:54.5386751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5386866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5386982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5387109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5387223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5387307Z     )
2025-05-07T20:32:54.5393059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5393173Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5393258Z         self,
2025-05-07T20:32:54.5393341Z         T: int,
2025-05-07T20:32:54.5393422Z         D: int,
2025-05-07T20:32:54.5393521Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5393610Z         contiguous: bool,
2025-05-07T20:32:54.5393702Z         compiled: bool,
2025-05-07T20:32:54.5393782Z     ) -> None:
2025-05-07T20:32:54.5393877Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5393955Z     
2025-05-07T20:32:54.5394132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5394205Z     
2025-05-07T20:32:54.5394305Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5394433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5394522Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5394610Z         x0 = x[:, :D]
2025-05-07T20:32:54.5394690Z         x1 = x[:, D:]
2025-05-07T20:32:54.5394770Z     
2025-05-07T20:32:54.5394855Z         if contiguous:
2025-05-07T20:32:54.5394946Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5395048Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5395122Z     
2025-05-07T20:32:54.5395211Z         if scale_ub is not None:
2025-05-07T20:32:54.5395322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5395568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5395647Z             )
2025-05-07T20:32:54.5395729Z         else:
2025-05-07T20:32:54.5395822Z             scale_ub_tensor = None
2025-05-07T20:32:54.5395897Z     
2025-05-07T20:32:54.5396035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5396125Z             op = silu_mul_quant
2025-05-07T20:32:54.5396209Z             if compiled:
2025-05-07T20:32:54.5396318Z                 op = torch.compile(op)
2025-05-07T20:32:54.5396473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5396554Z     
2025-05-07T20:32:54.5396645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5396649Z 
2025-05-07T20:32:54.5396747Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5396888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5396990Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5397090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5397623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5397768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5398146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5398376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5398730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5398835Z     kernel = self.compile(
2025-05-07T20:32:54.5399275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5399505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5399643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5399648Z 
2025-05-07T20:32:54.5399862Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec9eb0>
2025-05-07T20:32:54.5400683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5401203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c640e0>}
2025-05-07T20:32:54.5401988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5402185Z context = <triton._C.libtriton.ir.context object at 0x7fd871bca4b0>
2025-05-07T20:32:54.5402189Z 
2025-05-07T20:32:54.5402360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5402640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5402750Z                            module_map=module_map)
2025-05-07T20:32:54.5402924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5403025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5403106Z E       ^
2025-05-07T20:32:54.5403481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5403488Z 
2025-05-07T20:32:54.5403917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5403924Z 
2025-05-07T20:32:54.5404026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5404267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5404344Z     T=4096,
2025-05-07T20:32:54.5404471Z     D=7168,
2025-05-07T20:32:54.5404558Z     scale_ub=1200.0,
2025-05-07T20:32:54.5404645Z     contiguous=False,
2025-05-07T20:32:54.5404734Z     compiled=True,
2025-05-07T20:32:54.5404811Z )
2025-05-07T20:32:54.5405035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5405222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5405226Z 
2025-05-07T20:32:54.5405304Z     @given(
2025-05-07T20:32:54.5405425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5405572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5405688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5405811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5405928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5406003Z     )
2025-05-07T20:32:54.5406569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5406667Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5406744Z         self,
2025-05-07T20:32:54.5406829Z         T: int,
2025-05-07T20:32:54.5406904Z         D: int,
2025-05-07T20:32:54.5407156Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5407254Z         contiguous: bool,
2025-05-07T20:32:54.5407340Z         compiled: bool,
2025-05-07T20:32:54.5407419Z     ) -> None:
2025-05-07T20:32:54.5407520Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5407591Z     
2025-05-07T20:32:54.5407771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5407844Z     
2025-05-07T20:32:54.5407937Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5408069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5408228Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5408308Z         x0 = x[:, :D]
2025-05-07T20:32:54.5408397Z         x1 = x[:, D:]
2025-05-07T20:32:54.5408467Z     
2025-05-07T20:32:54.5408551Z         if contiguous:
2025-05-07T20:32:54.5408652Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5408741Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5408814Z     
2025-05-07T20:32:54.5408912Z         if scale_ub is not None:
2025-05-07T20:32:54.5409018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5409162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5409236Z             )
2025-05-07T20:32:54.5409313Z         else:
2025-05-07T20:32:54.5409412Z             scale_ub_tensor = None
2025-05-07T20:32:54.5409487Z     
2025-05-07T20:32:54.5409617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5409714Z             op = silu_mul_quant
2025-05-07T20:32:54.5409800Z             if compiled:
2025-05-07T20:32:54.5409902Z                 op = torch.compile(op)
2025-05-07T20:32:54.5410015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5410087Z     
2025-05-07T20:32:54.5410179Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5410185Z 
2025-05-07T20:32:54.5410291Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5410426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5410535Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5410637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5411016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5411119Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5411633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5411730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5412188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5412419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5412868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5412965Z     kernel = self.compile(
2025-05-07T20:32:54.5413367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5413552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5413683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5413753Z 
2025-05-07T20:32:54.5413969Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec9a90>
2025-05-07T20:32:54.5414784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5415318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c65300>}
2025-05-07T20:32:54.5416142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5416339Z context = <triton._C.libtriton.ir.context object at 0x7fd871c02fb0>
2025-05-07T20:32:54.5416344Z 
2025-05-07T20:32:54.5416525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5416799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5416905Z                            module_map=module_map)
2025-05-07T20:32:54.5417118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5417217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5417302Z E       ^
2025-05-07T20:32:54.5417673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5417678Z 
2025-05-07T20:32:54.5418111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5418115Z 
2025-05-07T20:32:54.5418227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5418460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5418544Z     T=128,
2025-05-07T20:32:54.5418626Z     D=7168,
2025-05-07T20:32:54.5418732Z     scale_ub=1200.0,
2025-05-07T20:32:54.5418835Z     contiguous=False,
2025-05-07T20:32:54.5418936Z     compiled=True,
2025-05-07T20:32:54.5419010Z )
2025-05-07T20:32:54.5419243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5419420Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.5419424Z 
2025-05-07T20:32:54.5419502Z     @given(
2025-05-07T20:32:54.5419631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5419730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5419850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5419978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5420092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5420173Z     )
2025-05-07T20:32:54.5420427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5420522Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5420603Z         self,
2025-05-07T20:32:54.5420683Z         T: int,
2025-05-07T20:32:54.5420759Z         D: int,
2025-05-07T20:32:54.5420869Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5420960Z         contiguous: bool,
2025-05-07T20:32:54.5421045Z         compiled: bool,
2025-05-07T20:32:54.5421129Z     ) -> None:
2025-05-07T20:32:54.5421223Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5421342Z     
2025-05-07T20:32:54.5421522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5421595Z     
2025-05-07T20:32:54.5421695Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5421819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5421907Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5421997Z         x0 = x[:, :D]
2025-05-07T20:32:54.5422076Z         x1 = x[:, D:]
2025-05-07T20:32:54.5422146Z     
2025-05-07T20:32:54.5422282Z         if contiguous:
2025-05-07T20:32:54.5422373Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5422463Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5422540Z     
2025-05-07T20:32:54.5422629Z         if scale_ub is not None:
2025-05-07T20:32:54.5422737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5422882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5422957Z             )
2025-05-07T20:32:54.5423040Z         else:
2025-05-07T20:32:54.5423140Z             scale_ub_tensor = None
2025-05-07T20:32:54.5423212Z     
2025-05-07T20:32:54.5423348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5423482Z             op = silu_mul_quant
2025-05-07T20:32:54.5423570Z             if compiled:
2025-05-07T20:32:54.5423673Z                 op = torch.compile(op)
2025-05-07T20:32:54.5423777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5423848Z     
2025-05-07T20:32:54.5423942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5423949Z 
2025-05-07T20:32:54.5424045Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5424183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5424284Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5424448Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5424837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5424933Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5425449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5425552Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5425926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5426161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5426516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5426610Z     kernel = self.compile(
2025-05-07T20:32:54.5427016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5427198Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5427332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5427336Z 
2025-05-07T20:32:54.5427551Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871ec93d0>
2025-05-07T20:32:54.5428371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5428953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c66020>}
2025-05-07T20:32:54.5429733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5429937Z context = <triton._C.libtriton.ir.context object at 0x7fd871dcfa70>
2025-05-07T20:32:54.5429941Z 
2025-05-07T20:32:54.5430154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5430433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5430548Z                            module_map=module_map)
2025-05-07T20:32:54.5430717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5430816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5430902Z E       ^
2025-05-07T20:32:54.5431311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5431316Z 
2025-05-07T20:32:54.5431753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5431760Z 
2025-05-07T20:32:54.5431864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5432100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5432182Z     T=2048,
2025-05-07T20:32:54.5432259Z     D=7168,
2025-05-07T20:32:54.5432341Z     scale_ub=None,
2025-05-07T20:32:54.5432432Z     contiguous=True,
2025-05-07T20:32:54.5432556Z     compiled=True,
2025-05-07T20:32:54.5432637Z )
2025-05-07T20:32:54.5432862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5433041Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.5433046Z 
2025-05-07T20:32:54.5433130Z     @given(
2025-05-07T20:32:54.5433252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5433352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5433477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5433640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5433764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5433838Z     )
2025-05-07T20:32:54.5434096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5434198Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5434274Z         self,
2025-05-07T20:32:54.5434354Z         T: int,
2025-05-07T20:32:54.5434441Z         D: int,
2025-05-07T20:32:54.5434540Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5434629Z         contiguous: bool,
2025-05-07T20:32:54.5434724Z         compiled: bool,
2025-05-07T20:32:54.5434802Z     ) -> None:
2025-05-07T20:32:54.5434899Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5434981Z     
2025-05-07T20:32:54.5435151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5435224Z     
2025-05-07T20:32:54.5435322Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5435449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5435543Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5435624Z         x0 = x[:, :D]
2025-05-07T20:32:54.5435704Z         x1 = x[:, D:]
2025-05-07T20:32:54.5435783Z     
2025-05-07T20:32:54.5435867Z         if contiguous:
2025-05-07T20:32:54.5435958Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5436053Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5436131Z     
2025-05-07T20:32:54.5436220Z         if scale_ub is not None:
2025-05-07T20:32:54.5436332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5436469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5436543Z             )
2025-05-07T20:32:54.5436628Z         else:
2025-05-07T20:32:54.5436723Z             scale_ub_tensor = None
2025-05-07T20:32:54.5436804Z     
2025-05-07T20:32:54.5436934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5437024Z             op = silu_mul_quant
2025-05-07T20:32:54.5437117Z             if compiled:
2025-05-07T20:32:54.5437216Z                 op = torch.compile(op)
2025-05-07T20:32:54.5437323Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5437402Z     
2025-05-07T20:32:54.5437547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5437552Z 
2025-05-07T20:32:54.5437649Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5437790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5437892Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5437997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5438378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5438562Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5439087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5439187Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5439562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5439800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5440154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5440297Z     kernel = self.compile(
2025-05-07T20:32:54.5440696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5440873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5441015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5441021Z 
2025-05-07T20:32:54.5441228Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871d7c650>
2025-05-07T20:32:54.5442049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5442617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871c67240>}
2025-05-07T20:32:54.5443396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5443596Z context = <triton._C.libtriton.ir.context object at 0x7fd871dc7d70>
2025-05-07T20:32:54.5443603Z 
2025-05-07T20:32:54.5443771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5444049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5444164Z                            module_map=module_map)
2025-05-07T20:32:54.5444330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5444438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5444518Z E       ^
2025-05-07T20:32:54.5444883Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5444897Z 
2025-05-07T20:32:54.5445325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5445330Z 
2025-05-07T20:32:54.5445433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5445671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5445751Z     T=16384,
2025-05-07T20:32:54.5445829Z     D=5120,
2025-05-07T20:32:54.5445922Z     scale_ub=None,
2025-05-07T20:32:54.5446008Z     contiguous=False,
2025-05-07T20:32:54.5446094Z     compiled=False,
2025-05-07T20:32:54.5446175Z )
2025-05-07T20:32:54.5446399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5446589Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5446640Z 
2025-05-07T20:32:54.5446717Z     @given(
2025-05-07T20:32:54.5446839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5446949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5447064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5447185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5447308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5447382Z     )
2025-05-07T20:32:54.5447686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5447781Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5447857Z         self,
2025-05-07T20:32:54.5447940Z         T: int,
2025-05-07T20:32:54.5448020Z         D: int,
2025-05-07T20:32:54.5448118Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5448213Z         contiguous: bool,
2025-05-07T20:32:54.5448300Z         compiled: bool,
2025-05-07T20:32:54.5448382Z     ) -> None:
2025-05-07T20:32:54.5448486Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5448561Z     
2025-05-07T20:32:54.5448732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5448857Z     
2025-05-07T20:32:54.5448951Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5449100Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5451038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5451085Z 
2025-05-07T20:32:54.5451215Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.5451220Z 
2025-05-07T20:32:54.5451323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5451556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5451640Z     T=4096,
2025-05-07T20:32:54.5451717Z     D=7168,
2025-05-07T20:32:54.5451800Z     scale_ub=1200.0,
2025-05-07T20:32:54.5451984Z     contiguous=True,
2025-05-07T20:32:54.5452067Z     compiled=True,
2025-05-07T20:32:54.5452144Z )
2025-05-07T20:32:54.5452380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5452558Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5452563Z 
2025-05-07T20:32:54.5452648Z     @given(
2025-05-07T20:32:54.5452768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5452866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5452988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5453105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5453218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5453301Z     )
2025-05-07T20:32:54.5453554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5453648Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5453732Z         self,
2025-05-07T20:32:54.5453808Z         T: int,
2025-05-07T20:32:54.5453885Z         D: int,
2025-05-07T20:32:54.5453990Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5454077Z         contiguous: bool,
2025-05-07T20:32:54.5454166Z         compiled: bool,
2025-05-07T20:32:54.5454263Z     ) -> None:
2025-05-07T20:32:54.5454357Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5454435Z     
2025-05-07T20:32:54.5454613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5454687Z     
2025-05-07T20:32:54.5454780Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5454961Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5456846Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5456900Z 
2025-05-07T20:32:54.5457024Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.5457031Z 
2025-05-07T20:32:54.5457134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5457369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5457449Z     T=16384,
2025-05-07T20:32:54.5457524Z     D=7168,
2025-05-07T20:32:54.5457614Z     scale_ub=None,
2025-05-07T20:32:54.5457700Z     contiguous=False,
2025-05-07T20:32:54.5457856Z     compiled=False,
2025-05-07T20:32:54.5457938Z )
2025-05-07T20:32:54.5458162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5458352Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5458356Z 
2025-05-07T20:32:54.5458433Z     @given(
2025-05-07T20:32:54.5458568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5458685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5458825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5458989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5459109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5459183Z     )
2025-05-07T20:32:54.5459442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5459545Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5459624Z         self,
2025-05-07T20:32:54.5459710Z         T: int,
2025-05-07T20:32:54.5459792Z         D: int,
2025-05-07T20:32:54.5459891Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5459985Z         contiguous: bool,
2025-05-07T20:32:54.5460071Z         compiled: bool,
2025-05-07T20:32:54.5460150Z     ) -> None:
2025-05-07T20:32:54.5460251Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5460328Z     
2025-05-07T20:32:54.5460495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5462382Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5462391Z 
2025-05-07T20:32:54.5462509Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5462513Z 
2025-05-07T20:32:54.5462621Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5462852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5462939Z     T=2048,
2025-05-07T20:32:54.5463015Z     D=7168,
2025-05-07T20:32:54.5463098Z     scale_ub=1200.0,
2025-05-07T20:32:54.5463189Z     contiguous=True,
2025-05-07T20:32:54.5463271Z     compiled=True,
2025-05-07T20:32:54.5463346Z )
2025-05-07T20:32:54.5463577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5463752Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5463757Z 
2025-05-07T20:32:54.5463882Z     @given(
2025-05-07T20:32:54.5464015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5464115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5464235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5464352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5464466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5464546Z     )
2025-05-07T20:32:54.5464799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5464936Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5465017Z         self,
2025-05-07T20:32:54.5465096Z         T: int,
2025-05-07T20:32:54.5465172Z         D: int,
2025-05-07T20:32:54.5465281Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5465373Z         contiguous: bool,
2025-05-07T20:32:54.5465458Z         compiled: bool,
2025-05-07T20:32:54.5465542Z     ) -> None:
2025-05-07T20:32:54.5465641Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5465720Z     
2025-05-07T20:32:54.5465889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5466010Z     
2025-05-07T20:32:54.5466108Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5466236Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5468090Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5468142Z 
2025-05-07T20:32:54.5468279Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.5468285Z 
2025-05-07T20:32:54.5468397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5468654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5468734Z     T=2048,
2025-05-07T20:32:54.5468810Z     D=7168,
2025-05-07T20:32:54.5468901Z     scale_ub=None,
2025-05-07T20:32:54.5468988Z     contiguous=True,
2025-05-07T20:32:54.5469080Z     compiled=False,
2025-05-07T20:32:54.5469156Z )
2025-05-07T20:32:54.5469384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5469566Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5469570Z 
2025-05-07T20:32:54.5469649Z     @given(
2025-05-07T20:32:54.5469772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5469881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5469995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5470115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5470238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5470314Z     )
2025-05-07T20:32:54.5470578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5470675Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5470752Z         self,
2025-05-07T20:32:54.5470837Z         T: int,
2025-05-07T20:32:54.5470917Z         D: int,
2025-05-07T20:32:54.5471015Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5471113Z         contiguous: bool,
2025-05-07T20:32:54.5471200Z         compiled: bool,
2025-05-07T20:32:54.5471278Z     ) -> None:
2025-05-07T20:32:54.5471381Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5471458Z     
2025-05-07T20:32:54.5471627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5471706Z     
2025-05-07T20:32:54.5471799Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.5473713Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5473756Z 
2025-05-07T20:32:54.5473875Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.5473880Z 
2025-05-07T20:32:54.5473988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5474222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5474299Z     T=1,
2025-05-07T20:32:54.5474382Z     D=7168,
2025-05-07T20:32:54.5474468Z     scale_ub=1200.0,
2025-05-07T20:32:54.5474557Z     contiguous=True,
2025-05-07T20:32:54.5474651Z     compiled=False,
2025-05-07T20:32:54.5474724Z )
2025-05-07T20:32:54.5475090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5475267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5475271Z 
2025-05-07T20:32:54.5475348Z     @given(
2025-05-07T20:32:54.5475475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5475573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5475688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5475811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5475926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5476044Z     )
2025-05-07T20:32:54.5476303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5476399Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5476476Z         self,
2025-05-07T20:32:54.5476563Z         T: int,
2025-05-07T20:32:54.5476641Z         D: int,
2025-05-07T20:32:54.5476740Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5476837Z         contiguous: bool,
2025-05-07T20:32:54.5476923Z         compiled: bool,
2025-05-07T20:32:54.5477005Z     ) -> None:
2025-05-07T20:32:54.5477099Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5477173Z     
2025-05-07T20:32:54.5477348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5477429Z     
2025-05-07T20:32:54.5477521Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5477651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5477740Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5477823Z         x0 = x[:, :D]
2025-05-07T20:32:54.5477909Z         x1 = x[:, D:]
2025-05-07T20:32:54.5477981Z     
2025-05-07T20:32:54.5478065Z         if contiguous:
2025-05-07T20:32:54.5478162Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5478255Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5478332Z     
2025-05-07T20:32:54.5478423Z         if scale_ub is not None:
2025-05-07T20:32:54.5478543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5478707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5478799Z             )
2025-05-07T20:32:54.5478876Z         else:
2025-05-07T20:32:54.5478977Z             scale_ub_tensor = None
2025-05-07T20:32:54.5479049Z     
2025-05-07T20:32:54.5479180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5479279Z             op = silu_mul_quant
2025-05-07T20:32:54.5479364Z             if compiled:
2025-05-07T20:32:54.5479463Z                 op = torch.compile(op)
2025-05-07T20:32:54.5479578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5479650Z     
2025-05-07T20:32:54.5479748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5479752Z 
2025-05-07T20:32:54.5479850Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5480030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5480139Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5480240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5480762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5480864Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5481237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5481514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5481868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5481965Z     kernel = self.compile(
2025-05-07T20:32:54.5482369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5482548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5482719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5482731Z 
2025-05-07T20:32:54.5482942Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871d7fb90>
2025-05-07T20:32:54.5483744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5484275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871da2520>}
2025-05-07T20:32:54.5485094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5485294Z context = <triton._C.libtriton.ir.context object at 0x7fd871b75030>
2025-05-07T20:32:54.5485298Z 
2025-05-07T20:32:54.5485469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5485740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5485855Z                            module_map=module_map)
2025-05-07T20:32:54.5486019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5486120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5486200Z E       ^
2025-05-07T20:32:54.5486566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5486574Z 
2025-05-07T20:32:54.5487009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5487016Z 
2025-05-07T20:32:54.5487119Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5487349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5487438Z     T=128,
2025-05-07T20:32:54.5487515Z     D=5120,
2025-05-07T20:32:54.5487605Z     scale_ub=None,
2025-05-07T20:32:54.5487690Z     contiguous=True,
2025-05-07T20:32:54.5487776Z     compiled=False,
2025-05-07T20:32:54.5487856Z )
2025-05-07T20:32:54.5488081Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5488256Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5488260Z 
2025-05-07T20:32:54.5488344Z     @given(
2025-05-07T20:32:54.5488464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5488566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5488688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5488851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5488978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5489057Z     )
2025-05-07T20:32:54.5489356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5489456Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5489533Z         self,
2025-05-07T20:32:54.5489611Z         T: int,
2025-05-07T20:32:54.5489696Z         D: int,
2025-05-07T20:32:54.5489794Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5489949Z         contiguous: bool,
2025-05-07T20:32:54.5490047Z         compiled: bool,
2025-05-07T20:32:54.5490125Z     ) -> None:
2025-05-07T20:32:54.5490219Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5490298Z     
2025-05-07T20:32:54.5490473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5490553Z     
2025-05-07T20:32:54.5490646Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5490770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5490869Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5490949Z         x0 = x[:, :D]
2025-05-07T20:32:54.5491030Z         x1 = x[:, D:]
2025-05-07T20:32:54.5491147Z     
2025-05-07T20:32:54.5491233Z         if contiguous:
2025-05-07T20:32:54.5491324Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5491420Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5491490Z     
2025-05-07T20:32:54.5491580Z         if scale_ub is not None:
2025-05-07T20:32:54.5491690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5491913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5491990Z             )
2025-05-07T20:32:54.5492073Z         else:
2025-05-07T20:32:54.5492168Z             scale_ub_tensor = None
2025-05-07T20:32:54.5492292Z     
2025-05-07T20:32:54.5492423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5492513Z             op = silu_mul_quant
2025-05-07T20:32:54.5492603Z             if compiled:
2025-05-07T20:32:54.5492704Z                 op = torch.compile(op)
2025-05-07T20:32:54.5492810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5492888Z     
2025-05-07T20:32:54.5492982Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5492987Z 
2025-05-07T20:32:54.5493083Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5493220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5493319Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5493428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5493942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5494038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5494416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5494645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5494996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5495102Z     kernel = self.compile(
2025-05-07T20:32:54.5495499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5495683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5495812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5495819Z 
2025-05-07T20:32:54.5496027Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b25e80>
2025-05-07T20:32:54.5496840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5497411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871da3420>}
2025-05-07T20:32:54.5498200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5498394Z context = <triton._C.libtriton.ir.context object at 0x7fd871bcedb0>
2025-05-07T20:32:54.5498437Z 
2025-05-07T20:32:54.5498613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5498884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5498995Z                            module_map=module_map)
2025-05-07T20:32:54.5499167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5499266Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5499345Z E       ^
2025-05-07T20:32:54.5499720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5499725Z 
2025-05-07T20:32:54.5500192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5500197Z 
2025-05-07T20:32:54.5500307Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5500536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5500615Z     T=128,
2025-05-07T20:32:54.5500699Z     D=7168,
2025-05-07T20:32:54.5500781Z     scale_ub=None,
2025-05-07T20:32:54.5500866Z     contiguous=True,
2025-05-07T20:32:54.5500955Z     compiled=False,
2025-05-07T20:32:54.5501072Z )
2025-05-07T20:32:54.5501298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5501477Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5501482Z 
2025-05-07T20:32:54.5501562Z     @given(
2025-05-07T20:32:54.5501688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5501792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5501906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5502030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5502147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5502221Z     )
2025-05-07T20:32:54.5502481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5502574Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5502655Z         self,
2025-05-07T20:32:54.5502733Z         T: int,
2025-05-07T20:32:54.5502813Z         D: int,
2025-05-07T20:32:54.5502920Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5503010Z         contiguous: bool,
2025-05-07T20:32:54.5503096Z         compiled: bool,
2025-05-07T20:32:54.5503179Z     ) -> None:
2025-05-07T20:32:54.5503277Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5503350Z     
2025-05-07T20:32:54.5503526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5503603Z     
2025-05-07T20:32:54.5503695Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5503827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5503916Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5503997Z         x0 = x[:, :D]
2025-05-07T20:32:54.5504085Z         x1 = x[:, D:]
2025-05-07T20:32:54.5504159Z     
2025-05-07T20:32:54.5504250Z         if contiguous:
2025-05-07T20:32:54.5504341Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5504430Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5504512Z     
2025-05-07T20:32:54.5504604Z         if scale_ub is not None:
2025-05-07T20:32:54.5504709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5504852Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5504927Z             )
2025-05-07T20:32:54.5505056Z         else:
2025-05-07T20:32:54.5505160Z             scale_ub_tensor = None
2025-05-07T20:32:54.5505233Z     
2025-05-07T20:32:54.5505366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5505462Z             op = silu_mul_quant
2025-05-07T20:32:54.5505546Z             if compiled:
2025-05-07T20:32:54.5505653Z                 op = torch.compile(op)
2025-05-07T20:32:54.5505758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5505829Z     
2025-05-07T20:32:54.5505969Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5505973Z 
2025-05-07T20:32:54.5506069Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5506517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5506635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5506736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5507257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5507360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5507866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5508101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5508482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5508592Z     kernel = self.compile(
2025-05-07T20:32:54.5509006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5509184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5509387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5509392Z 
2025-05-07T20:32:54.5509604Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b26300>
2025-05-07T20:32:54.5510415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5510940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd8719484a0>}
2025-05-07T20:32:54.5511718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5511918Z context = <triton._C.libtriton.ir.context object at 0x7fd8719e5470>
2025-05-07T20:32:54.5511922Z 
2025-05-07T20:32:54.5512089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5512363Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5512475Z                            module_map=module_map)
2025-05-07T20:32:54.5512642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5512746Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5512825Z E       ^
2025-05-07T20:32:54.5513191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5513198Z 
2025-05-07T20:32:54.5513632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5513637Z 
2025-05-07T20:32:54.5513740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5513977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5514053Z     T=2048,
2025-05-07T20:32:54.5514129Z     D=7168,
2025-05-07T20:32:54.5514217Z     scale_ub=1200.0,
2025-05-07T20:32:54.5514376Z     contiguous=True,
2025-05-07T20:32:54.5514463Z     compiled=False,
2025-05-07T20:32:54.5514542Z )
2025-05-07T20:32:54.5514768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5514947Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5514951Z 
2025-05-07T20:32:54.5515034Z     @given(
2025-05-07T20:32:54.5515155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5515253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5515444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5515560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5515679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5515758Z     )
2025-05-07T20:32:54.5516011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5516111Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5516194Z         self,
2025-05-07T20:32:54.5516272Z         T: int,
2025-05-07T20:32:54.5516354Z         D: int,
2025-05-07T20:32:54.5516450Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5516583Z         contiguous: bool,
2025-05-07T20:32:54.5516677Z         compiled: bool,
2025-05-07T20:32:54.5516775Z     ) -> None:
2025-05-07T20:32:54.5516877Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5516951Z     
2025-05-07T20:32:54.5517123Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5519024Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5519075Z 
2025-05-07T20:32:54.5524646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5524657Z 
2025-05-07T20:32:54.5524788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5525022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5525106Z     T=1,
2025-05-07T20:32:54.5525184Z     D=5120,
2025-05-07T20:32:54.5525269Z     scale_ub=1200.0,
2025-05-07T20:32:54.5525364Z     contiguous=True,
2025-05-07T20:32:54.5525450Z     compiled=False,
2025-05-07T20:32:54.5525522Z )
2025-05-07T20:32:54.5525756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5525934Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5525938Z 
2025-05-07T20:32:54.5526016Z     @given(
2025-05-07T20:32:54.5526143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5526247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5526370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5526490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5526606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5526688Z     )
2025-05-07T20:32:54.5526942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5527039Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5527124Z         self,
2025-05-07T20:32:54.5527202Z         T: int,
2025-05-07T20:32:54.5527281Z         D: int,
2025-05-07T20:32:54.5527387Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5527478Z         contiguous: bool,
2025-05-07T20:32:54.5527568Z         compiled: bool,
2025-05-07T20:32:54.5527657Z     ) -> None:
2025-05-07T20:32:54.5527753Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5527831Z     
2025-05-07T20:32:54.5528114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5528191Z     
2025-05-07T20:32:54.5528292Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5528427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5528519Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5528608Z         x0 = x[:, :D]
2025-05-07T20:32:54.5528690Z         x1 = x[:, D:]
2025-05-07T20:32:54.5528763Z     
2025-05-07T20:32:54.5528857Z         if contiguous:
2025-05-07T20:32:54.5528950Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5529084Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5529164Z     
2025-05-07T20:32:54.5529258Z         if scale_ub is not None:
2025-05-07T20:32:54.5529365Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5529514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5529593Z             )
2025-05-07T20:32:54.5529676Z         else:
2025-05-07T20:32:54.5529771Z             scale_ub_tensor = None
2025-05-07T20:32:54.5529848Z     
2025-05-07T20:32:54.5529990Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5530081Z             op = silu_mul_quant
2025-05-07T20:32:54.5530214Z             if compiled:
2025-05-07T20:32:54.5530323Z                 op = torch.compile(op)
2025-05-07T20:32:54.5530429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5530503Z     
2025-05-07T20:32:54.5530602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5530607Z 
2025-05-07T20:32:54.5530705Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5530853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5530961Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5531065Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5531637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5531740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5532192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5532434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5532790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5532894Z     kernel = self.compile(
2025-05-07T20:32:54.5533292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5533476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5533615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5533622Z 
2025-05-07T20:32:54.5533835Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871b26b70>
2025-05-07T20:32:54.5534654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5535176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871949a80>}
2025-05-07T20:32:54.5535948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5536149Z context = <triton._C.libtriton.ir.context object at 0x7fd871914570>
2025-05-07T20:32:54.5536153Z 
2025-05-07T20:32:54.5536324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5536601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5536754Z                            module_map=module_map)
2025-05-07T20:32:54.5536920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5537024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5537103Z E       ^
2025-05-07T20:32:54.5537470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5537481Z 
2025-05-07T20:32:54.5537907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5537952Z 
2025-05-07T20:32:54.5538056Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5538318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5538405Z     T=2048,
2025-05-07T20:32:54.5538497Z     D=5120,
2025-05-07T20:32:54.5538585Z     scale_ub=None,
2025-05-07T20:32:54.5538670Z     contiguous=True,
2025-05-07T20:32:54.5538754Z     compiled=False,
2025-05-07T20:32:54.5538832Z )
2025-05-07T20:32:54.5539057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5539242Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5539288Z 
2025-05-07T20:32:54.5539367Z     @given(
2025-05-07T20:32:54.5539487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5539593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5539708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5539826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5539946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5540020Z     )
2025-05-07T20:32:54.5540278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5540413Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5540490Z         self,
2025-05-07T20:32:54.5540571Z         T: int,
2025-05-07T20:32:54.5540646Z         D: int,
2025-05-07T20:32:54.5540745Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5540838Z         contiguous: bool,
2025-05-07T20:32:54.5540924Z         compiled: bool,
2025-05-07T20:32:54.5541004Z     ) -> None:
2025-05-07T20:32:54.5541104Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5541176Z     
2025-05-07T20:32:54.5541345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5541425Z     
2025-05-07T20:32:54.5541518Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.5543386Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5543398Z 
2025-05-07T20:32:54.5543518Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.5543523Z 
2025-05-07T20:32:54.5543634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5543864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5543942Z     T=16384,
2025-05-07T20:32:54.5544023Z     D=5120,
2025-05-07T20:32:54.5544105Z     scale_ub=None,
2025-05-07T20:32:54.5544195Z     contiguous=True,
2025-05-07T20:32:54.5544289Z     compiled=False,
2025-05-07T20:32:54.5544361Z )
2025-05-07T20:32:54.5544584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5544771Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5544778Z 
2025-05-07T20:32:54.5544855Z     @given(
2025-05-07T20:32:54.5544982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5545131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5545249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5545375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5545489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5545564Z     )
2025-05-07T20:32:54.5545824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5545919Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5545996Z         self,
2025-05-07T20:32:54.5546120Z         T: int,
2025-05-07T20:32:54.5546197Z         D: int,
2025-05-07T20:32:54.5546295Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5546393Z         contiguous: bool,
2025-05-07T20:32:54.5546479Z         compiled: bool,
2025-05-07T20:32:54.5546567Z     ) -> None:
2025-05-07T20:32:54.5546662Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5546734Z     
2025-05-07T20:32:54.5546910Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5548810Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5548818Z 
2025-05-07T20:32:54.5548962Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5548967Z 
2025-05-07T20:32:54.5549134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5549372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5549456Z     T=4096,
2025-05-07T20:32:54.5549532Z     D=5120,
2025-05-07T20:32:54.5549621Z     scale_ub=None,
2025-05-07T20:32:54.5549713Z     contiguous=True,
2025-05-07T20:32:54.5549805Z     compiled=False,
2025-05-07T20:32:54.5549885Z )
2025-05-07T20:32:54.5550110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5550286Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5550291Z 
2025-05-07T20:32:54.5550377Z     @given(
2025-05-07T20:32:54.5550495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5550597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5550720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5550836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5550958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5551033Z     )
2025-05-07T20:32:54.5551285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5551387Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5551467Z         self,
2025-05-07T20:32:54.5551543Z         T: int,
2025-05-07T20:32:54.5551625Z         D: int,
2025-05-07T20:32:54.5551724Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5551813Z         contiguous: bool,
2025-05-07T20:32:54.5551904Z         compiled: bool,
2025-05-07T20:32:54.5551981Z     ) -> None:
2025-05-07T20:32:54.5552073Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5552151Z     
2025-05-07T20:32:54.5552320Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5554229Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5554238Z 
2025-05-07T20:32:54.5554360Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5554365Z 
2025-05-07T20:32:54.5554472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5554699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5554776Z     T=2048,
2025-05-07T20:32:54.5554858Z     D=5120,
2025-05-07T20:32:54.5554983Z     scale_ub=None,
2025-05-07T20:32:54.5555068Z     contiguous=False,
2025-05-07T20:32:54.5555158Z     compiled=False,
2025-05-07T20:32:54.5555230Z )
2025-05-07T20:32:54.5555452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5555638Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5555645Z 
2025-05-07T20:32:54.5555723Z     @given(
2025-05-07T20:32:54.5555848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5555949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5556062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5556227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5556345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5556419Z     )
2025-05-07T20:32:54.5556678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5556775Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5556851Z         self,
2025-05-07T20:32:54.5556931Z         T: int,
2025-05-07T20:32:54.5557008Z         D: int,
2025-05-07T20:32:54.5557107Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5557267Z         contiguous: bool,
2025-05-07T20:32:54.5557353Z         compiled: bool,
2025-05-07T20:32:54.5557436Z     ) -> None:
2025-05-07T20:32:54.5557531Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5557604Z     
2025-05-07T20:32:54.5557780Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5559631Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5559638Z 
2025-05-07T20:32:54.5559761Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5559768Z 
2025-05-07T20:32:54.5559871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5560097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5560183Z     T=4096,
2025-05-07T20:32:54.5560259Z     D=7168,
2025-05-07T20:32:54.5560345Z     scale_ub=None,
2025-05-07T20:32:54.5560435Z     contiguous=True,
2025-05-07T20:32:54.5560520Z     compiled=True,
2025-05-07T20:32:54.5560599Z )
2025-05-07T20:32:54.5560822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5560993Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.5560998Z 
2025-05-07T20:32:54.5561082Z     @given(
2025-05-07T20:32:54.5561202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5561301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5561421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5561540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5561653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5561738Z     )
2025-05-07T20:32:54.5562034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5562136Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5562213Z         self,
2025-05-07T20:32:54.5562292Z         T: int,
2025-05-07T20:32:54.5562380Z         D: int,
2025-05-07T20:32:54.5562479Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5562569Z         contiguous: bool,
2025-05-07T20:32:54.5562663Z         compiled: bool,
2025-05-07T20:32:54.5562742Z     ) -> None:
2025-05-07T20:32:54.5562838Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5562961Z     
2025-05-07T20:32:54.5563131Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5565030Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5565039Z 
2025-05-07T20:32:54.5565158Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5565162Z 
2025-05-07T20:32:54.5565270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5565497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5565577Z     T=2048,
2025-05-07T20:32:54.5565666Z     D=5120,
2025-05-07T20:32:54.5565750Z     scale_ub=1200.0,
2025-05-07T20:32:54.5565836Z     contiguous=False,
2025-05-07T20:32:54.5565927Z     compiled=False,
2025-05-07T20:32:54.5566045Z )
2025-05-07T20:32:54.5566268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5566456Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5566461Z 
2025-05-07T20:32:54.5566541Z     @given(
2025-05-07T20:32:54.5566667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5566765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5566881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5567005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5567118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5567195Z     )
2025-05-07T20:32:54.5567453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5567548Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5567623Z         self,
2025-05-07T20:32:54.5567704Z         T: int,
2025-05-07T20:32:54.5567779Z         D: int,
2025-05-07T20:32:54.5567878Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5567971Z         contiguous: bool,
2025-05-07T20:32:54.5568056Z         compiled: bool,
2025-05-07T20:32:54.5568138Z     ) -> None:
2025-05-07T20:32:54.5568237Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5568310Z     
2025-05-07T20:32:54.5568484Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5570331Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5570341Z 
2025-05-07T20:32:54.5570464Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5570469Z 
2025-05-07T20:32:54.5570569Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5570845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5570930Z     T=4096,
2025-05-07T20:32:54.5571006Z     D=7168,
2025-05-07T20:32:54.5571091Z     scale_ub=1200.0,
2025-05-07T20:32:54.5571185Z     contiguous=True,
2025-05-07T20:32:54.5571269Z     compiled=False,
2025-05-07T20:32:54.5571348Z )
2025-05-07T20:32:54.5571571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5571748Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5571800Z 
2025-05-07T20:32:54.5571957Z     @given(
2025-05-07T20:32:54.5572076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5572174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5572296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5572412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5572530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5572605Z     )
2025-05-07T20:32:54.5572860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5572959Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5573078Z         self,
2025-05-07T20:32:54.5573157Z         T: int,
2025-05-07T20:32:54.5573239Z         D: int,
2025-05-07T20:32:54.5573336Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5573423Z         contiguous: bool,
2025-05-07T20:32:54.5573517Z         compiled: bool,
2025-05-07T20:32:54.5573594Z     ) -> None:
2025-05-07T20:32:54.5573690Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5573770Z     
2025-05-07T20:32:54.5573941Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5575796Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5575843Z 
2025-05-07T20:32:54.5575967Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5575971Z 
2025-05-07T20:32:54.5576079Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5576310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5576391Z     T=16384,
2025-05-07T20:32:54.5576474Z     D=7168,
2025-05-07T20:32:54.5576556Z     scale_ub=None,
2025-05-07T20:32:54.5576643Z     contiguous=False,
2025-05-07T20:32:54.5576736Z     compiled=True,
2025-05-07T20:32:54.5576809Z )
2025-05-07T20:32:54.5577030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5577219Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5577224Z 
2025-05-07T20:32:54.5577300Z     @given(
2025-05-07T20:32:54.5577430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5577528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5577641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5577763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5577876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5577951Z     )
2025-05-07T20:32:54.5578211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5578305Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5578381Z         self,
2025-05-07T20:32:54.5578467Z         T: int,
2025-05-07T20:32:54.5578543Z         D: int,
2025-05-07T20:32:54.5578640Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5578736Z         contiguous: bool,
2025-05-07T20:32:54.5578821Z         compiled: bool,
2025-05-07T20:32:54.5578952Z     ) -> None:
2025-05-07T20:32:54.5579048Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5579125Z     
2025-05-07T20:32:54.5579303Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5581149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5581196Z 
2025-05-07T20:32:54.5581324Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5581328Z 
2025-05-07T20:32:54.5581432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5581659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5581742Z     T=4096,
2025-05-07T20:32:54.5581870Z     D=7168,
2025-05-07T20:32:54.5581953Z     scale_ub=None,
2025-05-07T20:32:54.5582046Z     contiguous=True,
2025-05-07T20:32:54.5582133Z     compiled=False,
2025-05-07T20:32:54.5582214Z )
2025-05-07T20:32:54.5582436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5582617Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5582622Z 
2025-05-07T20:32:54.5582705Z     @given(
2025-05-07T20:32:54.5582824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5582964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5583086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5583202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5583326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5583401Z     )
2025-05-07T20:32:54.5583651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5583755Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5583831Z         self,
2025-05-07T20:32:54.5583907Z         T: int,
2025-05-07T20:32:54.5583989Z         D: int,
2025-05-07T20:32:54.5584086Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5584172Z         contiguous: bool,
2025-05-07T20:32:54.5584266Z         compiled: bool,
2025-05-07T20:32:54.5584344Z     ) -> None:
2025-05-07T20:32:54.5584439Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5584519Z     
2025-05-07T20:32:54.5584686Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5586545Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5586551Z 
2025-05-07T20:32:54.5586668Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5586673Z 
2025-05-07T20:32:54.5586781Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5587008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5587085Z     T=16384,
2025-05-07T20:32:54.5587167Z     D=7168,
2025-05-07T20:32:54.5587262Z     scale_ub=None,
2025-05-07T20:32:54.5587350Z     contiguous=True,
2025-05-07T20:32:54.5587443Z     compiled=False,
2025-05-07T20:32:54.5587515Z )
2025-05-07T20:32:54.5587785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5587976Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5587980Z 
2025-05-07T20:32:54.5588060Z     @given(
2025-05-07T20:32:54.5588182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5588287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5588400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5588517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5588676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5588751Z     )
2025-05-07T20:32:54.5589009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5589107Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5589185Z         self,
2025-05-07T20:32:54.5589268Z         T: int,
2025-05-07T20:32:54.5589346Z         D: int,
2025-05-07T20:32:54.5589444Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5589544Z         contiguous: bool,
2025-05-07T20:32:54.5589631Z         compiled: bool,
2025-05-07T20:32:54.5589708Z     ) -> None:
2025-05-07T20:32:54.5589808Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5589949Z     
2025-05-07T20:32:54.5590120Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5591975Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5592021Z 
2025-05-07T20:32:54.5592141Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5592155Z 
2025-05-07T20:32:54.5592257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5592484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5592569Z     T=16384,
2025-05-07T20:32:54.5592648Z     D=7168,
2025-05-07T20:32:54.5592732Z     scale_ub=1200.0,
2025-05-07T20:32:54.5592827Z     contiguous=True,
2025-05-07T20:32:54.5592911Z     compiled=False,
2025-05-07T20:32:54.5592982Z )
2025-05-07T20:32:54.5593214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5593399Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5593404Z 
2025-05-07T20:32:54.5593485Z     @given(
2025-05-07T20:32:54.5593608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5593705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5593826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5593944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5594059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5594137Z     )
2025-05-07T20:32:54.5594391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5594485Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5594567Z         self,
2025-05-07T20:32:54.5594642Z         T: int,
2025-05-07T20:32:54.5594717Z         D: int,
2025-05-07T20:32:54.5594823Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5594916Z         contiguous: bool,
2025-05-07T20:32:54.5595006Z         compiled: bool,
2025-05-07T20:32:54.5595082Z     ) -> None:
2025-05-07T20:32:54.5595177Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5595257Z     
2025-05-07T20:32:54.5595425Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5597323Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5597336Z 
2025-05-07T20:32:54.5597454Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5597495Z 
2025-05-07T20:32:54.5597597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5597830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5597914Z     T=128,
2025-05-07T20:32:54.5597990Z     D=5120,
2025-05-07T20:32:54.5598083Z     scale_ub=1200.0,
2025-05-07T20:32:54.5598169Z     contiguous=False,
2025-05-07T20:32:54.5598260Z     compiled=False,
2025-05-07T20:32:54.5598337Z )
2025-05-07T20:32:54.5598561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5598787Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5598793Z 
2025-05-07T20:32:54.5598871Z     @given(
2025-05-07T20:32:54.5598990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5599094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5599207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5599324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5599443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5599515Z     )
2025-05-07T20:32:54.5599772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5599906Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5599981Z         self,
2025-05-07T20:32:54.5600062Z         T: int,
2025-05-07T20:32:54.5600138Z         D: int,
2025-05-07T20:32:54.5600238Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5600334Z         contiguous: bool,
2025-05-07T20:32:54.5600418Z         compiled: bool,
2025-05-07T20:32:54.5600498Z     ) -> None:
2025-05-07T20:32:54.5600597Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5600670Z     
2025-05-07T20:32:54.5600837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5600921Z     
2025-05-07T20:32:54.5601012Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5601147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5601236Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5601314Z         x0 = x[:, :D]
2025-05-07T20:32:54.5601403Z         x1 = x[:, D:]
2025-05-07T20:32:54.5601478Z     
2025-05-07T20:32:54.5601562Z         if contiguous:
2025-05-07T20:32:54.5601662Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5601752Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5601825Z     
2025-05-07T20:32:54.5601922Z         if scale_ub is not None:
2025-05-07T20:32:54.5602029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5602166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5602250Z             )
2025-05-07T20:32:54.5602327Z         else:
2025-05-07T20:32:54.5602421Z             scale_ub_tensor = None
2025-05-07T20:32:54.5602497Z     
2025-05-07T20:32:54.5602627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5602724Z             op = silu_mul_quant
2025-05-07T20:32:54.5602810Z             if compiled:
2025-05-07T20:32:54.5602913Z                 op = torch.compile(op)
2025-05-07T20:32:54.5603025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5603098Z     
2025-05-07T20:32:54.5603193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5603197Z 
2025-05-07T20:32:54.5603300Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5603433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5603586Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5603693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5604215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5604318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5604689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5604921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5605322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5605420Z     kernel = self.compile(
2025-05-07T20:32:54.5605822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5606004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5606390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5606396Z 
2025-05-07T20:32:54.5606769Z self = <triton.compiler.compiler.ASTSource object at 0x7fd871897470>
2025-05-07T20:32:54.5607577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5608109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd8718307c0>}
2025-05-07T20:32:54.5608910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5609196Z context = <triton._C.libtriton.ir.context object at 0x7fd87176bbf0>
2025-05-07T20:32:54.5609201Z 
2025-05-07T20:32:54.5609376Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5609654Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5609766Z                            module_map=module_map)
2025-05-07T20:32:54.5609931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5610030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5610116Z E       ^
2025-05-07T20:32:54.5610484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5610489Z 
2025-05-07T20:32:54.5610921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5610931Z 
2025-05-07T20:32:54.5611035Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5611266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5611354Z     T=2048,
2025-05-07T20:32:54.5611430Z     D=7168,
2025-05-07T20:32:54.5611515Z     scale_ub=None,
2025-05-07T20:32:54.5611606Z     contiguous=False,
2025-05-07T20:32:54.5611690Z     compiled=False,
2025-05-07T20:32:54.5611762Z )
2025-05-07T20:32:54.5612044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5612223Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5612231Z 
2025-05-07T20:32:54.5612317Z     @given(
2025-05-07T20:32:54.5612436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5612533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5612657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5612775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5612889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5613041Z     )
2025-05-07T20:32:54.5613295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5613390Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5613476Z         self,
2025-05-07T20:32:54.5613551Z         T: int,
2025-05-07T20:32:54.5613628Z         D: int,
2025-05-07T20:32:54.5613732Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5613822Z         contiguous: bool,
2025-05-07T20:32:54.5613913Z         compiled: bool,
2025-05-07T20:32:54.5614155Z     ) -> None:
2025-05-07T20:32:54.5614249Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5614329Z     
2025-05-07T20:32:54.5614498Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5616405Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5616418Z 
2025-05-07T20:32:54.5616538Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5616542Z 
2025-05-07T20:32:54.5616644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5616878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5616956Z     T=128,
2025-05-07T20:32:54.5617035Z     D=7168,
2025-05-07T20:32:54.5617128Z     scale_ub=1200.0,
2025-05-07T20:32:54.5617253Z     contiguous=True,
2025-05-07T20:32:54.5617342Z     compiled=True,
2025-05-07T20:32:54.5617416Z )
2025-05-07T20:32:54.5617638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5617818Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5617823Z 
2025-05-07T20:32:54.5617901Z     @given(
2025-05-07T20:32:54.5618020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5618125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5618239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5618356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5618501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5618584Z     )
2025-05-07T20:32:54.5618864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5618962Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5619041Z         self,
2025-05-07T20:32:54.5619124Z         T: int,
2025-05-07T20:32:54.5619201Z         D: int,
2025-05-07T20:32:54.5619300Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5619400Z         contiguous: bool,
2025-05-07T20:32:54.5619488Z         compiled: bool,
2025-05-07T20:32:54.5619570Z     ) -> None:
2025-05-07T20:32:54.5619673Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5619745Z     
2025-05-07T20:32:54.5619918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5620000Z     
2025-05-07T20:32:54.5620093Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5620226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5620315Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5620396Z         x0 = x[:, :D]
2025-05-07T20:32:54.5620483Z         x1 = x[:, D:]
2025-05-07T20:32:54.5620556Z     
2025-05-07T20:32:54.5620638Z         if contiguous:
2025-05-07T20:32:54.5620736Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5620828Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5620900Z     
2025-05-07T20:32:54.5620999Z         if scale_ub is not None:
2025-05-07T20:32:54.5621105Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5621289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5621376Z             )
2025-05-07T20:32:54.5621452Z         else:
2025-05-07T20:32:54.5621557Z             scale_ub_tensor = None
2025-05-07T20:32:54.5621630Z     
2025-05-07T20:32:54.5621760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5621856Z             op = silu_mul_quant
2025-05-07T20:32:54.5621943Z             if compiled:
2025-05-07T20:32:54.5622042Z                 op = torch.compile(op)
2025-05-07T20:32:54.5622219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5622292Z     
2025-05-07T20:32:54.5622382Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5622386Z 
2025-05-07T20:32:54.5622486Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5622622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5622729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5622829Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5623214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5623313Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5623870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5623969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5624343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5624575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5624929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5625065Z     kernel = self.compile(
2025-05-07T20:32:54.5625460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5625650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5625783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5625788Z 
2025-05-07T20:32:54.5625997Z self = <triton.compiler.compiler.ASTSource object at 0x7fd87173bcb0>
2025-05-07T20:32:54.5626811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5627338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fdb7a61d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd871831940>}
2025-05-07T20:32:54.5628123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5628315Z context = <triton._C.libtriton.ir.context object at 0x7fd8717310f0>
2025-05-07T20:32:54.5628320Z 
2025-05-07T20:32:54.5628495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5628766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5628874Z                            module_map=module_map)
2025-05-07T20:32:54.5629045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5629145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5629223Z E       ^
2025-05-07T20:32:54.5629598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5629605Z 
2025-05-07T20:32:54.5630033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5630038Z 
2025-05-07T20:32:54.5630192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5630423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5630502Z     T=128,
2025-05-07T20:32:54.5630585Z     D=7168,
2025-05-07T20:32:54.5630668Z     scale_ub=1200.0,
2025-05-07T20:32:54.5630754Z     contiguous=True,
2025-05-07T20:32:54.5630845Z     compiled=False,
2025-05-07T20:32:54.5630916Z )
2025-05-07T20:32:54.5631146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5631362Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.5631366Z 
2025-05-07T20:32:54.5631444Z     @given(
2025-05-07T20:32:54.5631570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5631674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5631789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5631911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5632028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5632102Z     )
2025-05-07T20:32:54.5632400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5632496Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5632580Z         self,
2025-05-07T20:32:54.5632656Z         T: int,
2025-05-07T20:32:54.5632733Z         D: int,
2025-05-07T20:32:54.5632836Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5632927Z         contiguous: bool,
2025-05-07T20:32:54.5633014Z         compiled: bool,
2025-05-07T20:32:54.5633098Z     ) -> None:
2025-05-07T20:32:54.5633193Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5633266Z     
2025-05-07T20:32:54.5633443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5633561Z     
2025-05-07T20:32:54.5633654Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5633787Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5635651Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5635665Z 
2025-05-07T20:32:54.5635785Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.5635790Z 
2025-05-07T20:32:54.5635892Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5636129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5636206Z     T=128,
2025-05-07T20:32:54.5636284Z     D=5120,
2025-05-07T20:32:54.5636376Z     scale_ub=1200.0,
2025-05-07T20:32:54.5636464Z     contiguous=True,
2025-05-07T20:32:54.5636548Z     compiled=True,
2025-05-07T20:32:54.5636629Z )
2025-05-07T20:32:54.5636856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5637027Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5637041Z 
2025-05-07T20:32:54.5637117Z     @given(
2025-05-07T20:32:54.5637238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5637344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5637459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5637576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5637696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5637772Z     )
2025-05-07T20:32:54.5638026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5638128Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5638248Z         self,
2025-05-07T20:32:54.5638327Z         T: int,
2025-05-07T20:32:54.5638411Z         D: int,
2025-05-07T20:32:54.5638512Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5638610Z         contiguous: bool,
2025-05-07T20:32:54.5638697Z         compiled: bool,
2025-05-07T20:32:54.5638775Z     ) -> None:
2025-05-07T20:32:54.5638879Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5638972Z     
2025-05-07T20:32:54.5639169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5639297Z     
2025-05-07T20:32:54.5639392Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5639515Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5641414Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5641422Z 
2025-05-07T20:32:54.5641540Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.5641545Z 
2025-05-07T20:32:54.5641652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5641880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5641966Z     T=128,
2025-05-07T20:32:54.5642046Z     D=7168,
2025-05-07T20:32:54.5642129Z     scale_ub=None,
2025-05-07T20:32:54.5642219Z     contiguous=True,
2025-05-07T20:32:54.5642341Z     compiled=True,
2025-05-07T20:32:54.5642414Z )
2025-05-07T20:32:54.5642644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5642816Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.5642821Z 
2025-05-07T20:32:54.5642898Z     @given(
2025-05-07T20:32:54.5643021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5643120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5643240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5643357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5643472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5643557Z     )
2025-05-07T20:32:54.5643808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5643902Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5643986Z         self,
2025-05-07T20:32:54.5644065Z         T: int,
2025-05-07T20:32:54.5644141Z         D: int,
2025-05-07T20:32:54.5644246Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5644335Z         contiguous: bool,
2025-05-07T20:32:54.5644421Z         compiled: bool,
2025-05-07T20:32:54.5644506Z     ) -> None:
2025-05-07T20:32:54.5644600Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5644683Z     
2025-05-07T20:32:54.5644854Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5646692Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5646711Z 
2025-05-07T20:32:54.5646828Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5647014Z =============================== warnings summary ===============================
2025-05-07T20:32:54.5647343Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.5647658Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.5647966Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.5648928Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:54.5649203Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:54.5649210Z 
2025-05-07T20:32:54.5649436Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:54.5649608Z ================= 1 failed, 1 deselected, 3 warnings in 13.16s =================
2025-05-07T20:32:56.1518099Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:56.2147861Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:32:56.2148534Z 
2025-05-07T20:32:56.2149195Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:32:56.2150077Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:32:56.2150665Z 
2025-05-07T20:32:56.2150671Z 
2025-05-07T20:32:56.2150677Z 
2025-05-07T20:32:56.2168201Z ##[error]Process completed with exit code 1.
2025-05-07T20:32:56.2250004Z Post job cleanup.
2025-05-07T20:32:56.3222282Z [command]/usr/bin/git version
2025-05-07T20:32:56.3262061Z git version 2.47.1
2025-05-07T20:32:56.3296423Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/7d47374c-685e-4702-832f-41fd22dfa44f/.gitconfig'
2025-05-07T20:32:56.3306272Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/7d47374c-685e-4702-832f-41fd22dfa44f' before making global git config changes
2025-05-07T20:32:56.3307145Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:32:56.3318994Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:32:56.3360968Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:32:56.3396002Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:32:56.3734149Z Entering 'external/asmjit'
2025-05-07T20:32:56.3799125Z Entering 'external/composable_kernel'
2025-05-07T20:32:56.3877152Z Entering 'external/cpuinfo'
2025-05-07T20:32:56.3944950Z Entering 'external/cutlass'
2025-05-07T20:32:56.4020129Z Entering 'external/googletest'
2025-05-07T20:32:56.4086147Z Entering 'external/hipify_torch'
2025-05-07T20:32:56.4153550Z Entering 'external/json'
2025-05-07T20:32:56.4238847Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:32:56.4265069Z http.https://github.com/.extraheader
2025-05-07T20:32:56.4277956Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:32:56.4309956Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:32:56.4637364Z Entering 'external/asmjit'
2025-05-07T20:32:56.4681871Z http.https://github.com/.extraheader
2025-05-07T20:32:56.4723252Z Entering 'external/composable_kernel'
2025-05-07T20:32:56.4767077Z http.https://github.com/.extraheader
2025-05-07T20:32:56.4816568Z Entering 'external/cpuinfo'
2025-05-07T20:32:56.4861911Z http.https://github.com/.extraheader
2025-05-07T20:32:56.4904074Z Entering 'external/cutlass'
2025-05-07T20:32:56.4947086Z http.https://github.com/.extraheader
2025-05-07T20:32:56.4998304Z Entering 'external/googletest'
2025-05-07T20:32:56.5046498Z http.https://github.com/.extraheader
2025-05-07T20:32:56.5088775Z Entering 'external/hipify_torch'
2025-05-07T20:32:56.5132292Z http.https://github.com/.extraheader
2025-05-07T20:32:56.5173935Z Entering 'external/json'
2025-05-07T20:32:56.5217205Z http.https://github.com/.extraheader
2025-05-07T20:32:56.5367835Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:32:56.5404098Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:32:56.5415515Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:32:56.5415886Z ##[endgroup]
2025-05-07T20:32:56.5516149Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:07.2814744Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:23.7199544Z Cleaning up orphan processes